File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/p03-1058_intro.xml

Size: 17,203 bytes

Last Modified: 2025-10-06 14:01:46

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-1058">
  <Title>Exploiting Parallel Texts for Word Sense Disambiguation: An Empirical Study Hwee Tou Ng</Title>
  <Section position="3" start_page="0" end_page="4" type="intro">
    <SectionTitle>
3 An Empirical Study
</SectionTitle>
    <Paragraph position="0"> We evaluated our approach to word sense disambiguation on all the 29 nouns in the English lexical sample task of SENSEVAL-2 (Edmonds and Cotton, 2001; Kilgarriff 2001). The list of 29 nouns is given in Table 3. The second column of Table 3 lists the number of senses of each noun as given in the WordNet 1.7 sense inventory (Miller, 1990).</Paragraph>
    <Paragraph position="1"> We first lump together two senses s  are translated into the same Chinese word. The number of senses of each noun after sense lumping is given in column 3 of Table 3. For the 7 nouns that are lumped into one sense (i.e., they are all translated into one Chinese word), we do not perform WSD on these words. The average number of senses before and after sense lumping is 5.07 and 3.52 respectively.</Paragraph>
    <Paragraph position="2"> After sense lumping, we trained a WSD classifier for each noun w, by using the lumped senses in the manually sense-tagged training data for w provided by the SENSEVAL-2 organizers. We then tested the WSD classifier on the official SENSEVAL-2 test data (but with lumped senses) for w. The test accuracy (based on fine-grained scoring of SENSEVAL-2) of each noun obtained is listed in the column labeled M1 in Table 3.</Paragraph>
    <Paragraph position="3"> We then used our approach of parallel text alignment described in the last section to obtain the training examples from the English side of the parallel texts. Due to the memory size limitation of our machine, we were not able to align all six parallel corpora of 280MB in one alignment run of GIZA++. For two of the corpora, Hong Kong Hansards and Xinhua News, we gathered all English sentences containing the 29 SENSEVAL-2 noun occurrences (and their sentence-aligned Chinese sentence counterparts). This subset, together with the complete corpora of Hong Kong News, Hong Kong Laws, English translation of Chinese Treebank, and Sinorama, is then given to GIZA++ to perform one word alignment run. It took about 40 hours on our 2.4 GHz machine with 2 GB memory to perform this alignment.</Paragraph>
    <Paragraph position="4"> After word alignment, each 3-sentence context in English containing an occurrence of the noun w that is aligned to a selected Chinese translation then forms a training example. For each SENSEVAL-2 noun w, we then collected training examples from the English side of the parallel texts using the same number of training examples for each sense of w that are present in the manually sense-tagged SENSEVAL-2 official training corpus (lumped-sense version). If there are insufficient training examples for some sense of w from the parallel texts, then we just used as many parallel text training examples as we could find for that sense. We chose the same number of training examples for each sense as the official training data so that we can do a fair comparison between the accuracy of the parallel text alignment approach versus the manual sense-tagging approach.</Paragraph>
    <Paragraph position="5"> After training a WSD classifier for w with such parallel text examples, we then evaluated the WSD classifier on the same official SENSEVAL-2 test set (with lumped senses). The test accuracy of each noun obtained by training on such parallel text training examples (averaged over 10 trials) is listed in the column labeled P1 in Table 3.</Paragraph>
    <Paragraph position="6"> The baseline accuracy for each noun is also listed in the column labeled &amp;quot;P1-Baseline&amp;quot; in Table 3. The baseline accuracy corresponds to always picking the most frequently occurring sense in the training data.</Paragraph>
    <Paragraph position="7"> Ideally, we would hope M1 and P1 to be close in value, since this would imply that WSD based on training examples collected from the parallel text alignment approach performs as well as manually sense-tagged training examples. Comparing the M1 and P1 figures, we observed that there is a set of nouns for which they are relatively close.</Paragraph>
    <Paragraph position="8"> These nouns are: bar, bum, chair, day, dyke, fatigue, hearth, mouth, nation, nature, post, restraint, sense, stress. This set of nouns is relatively easy to disambiguate, since using the mostfrequently-occurring-sense baseline would have done well for most of these nouns.</Paragraph>
    <Paragraph position="9"> The parallel text alignment approach works well for nature and sense, among these nouns. For nature, the parallel text alignment approach gives better accuracy, and for sense the accuracy difference is only 0.014 (while there is a relatively large difference of 0.231 between P1 and P1-Baseline of sense). This demonstrates that the parallel text alignment approach to acquiring training examples can yield good results.</Paragraph>
    <Paragraph position="10"> For the remaining nouns (art, authority, channel, church, circuit, facility, grip, spade), the accuracy difference between M1 and P1 is at least 0.10. Henceforth, we shall refer to this set of 8 nouns as &amp;quot;difficult&amp;quot; nouns. We will give an analysis of the reason for the accuracy difference between M1 and P1 in the next section.</Paragraph>
    <Section position="1" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
4.1
Analysis
Sense-Tag Accuracy of Parallel Text
Training Examples
</SectionTitle>
      <Paragraph position="0"> To see why there is still a difference between the accuracy of the two approaches, we first examined the quality of the training examples obtained through parallel text alignment. If the automatically acquired training examples are noisy, then this could account for the lower P1 score.</Paragraph>
      <Paragraph position="1"> The word alignment output of GIZA++ contains much noise in general (especially for the low frequency words). However, note that in our approach, we only select the English word occurrences that align to our manually selected Chinese translations. Hence, while the complete set of word alignment output contains much noise, the subset of word occurrences chosen may still have high quality sense tags.</Paragraph>
      <Paragraph position="2"> Our manual inspection reveals that the annotation errors introduced by parallel text alignment can be attributed to the following sources: (i) Wrong sentence alignment: Due to erroneous sentence segmentation or sentence alignment, the correct Chinese word that an English word w should align to is not present in its Chinese sentence counterpart. In this case, word alignment will align the wrong Chinese word to w.</Paragraph>
      <Paragraph position="3"> (ii) Presence of multiple Chinese translation candidates: Sometimes, multiple and distinct Chinese translations appear in the aligned Chinese sentence. For example, for an English occurrence channel, both &amp;quot;Pin Dao &amp;quot; (sense 1 translation) and &amp;quot; Tu Jing &amp;quot; (sense 5 translation) happen to appear in the aligned Chinese sentence. In this case, word alignment may erroneously align the wrong Chinese translation to channel.</Paragraph>
      <Paragraph position="4"> (iii) Truly ambiguous word: Sometimes, a word is truly ambiguous in a particular context, and different translators may translate it differently. For example, in the phrase &amp;quot;the church meeting&amp;quot;, church could be the physical building sense (Jiao Tang ), or the institution sense (Jiao Hui ). In manual sense tagging done in SENSEVAL-2, it is possible to assign two sense tags to church in this case, but in the parallel text setting, a particular translator will translate it in one of the two ways (Jiao Tang or Jiao Hui ), and hence the sense tag found by parallel text alignment is only one of the two sense tags.</Paragraph>
      <Paragraph position="5"> By manually examining a subset of about 1,000 examples, we estimate that the sense-tag error rate of training examples (tagged with lumped senses) obtained by our parallel text alignment approach is less than 1%, which compares favorably with the quality of manually sense tagged corpus prepared in SENSEVAL-2 (Kilgarriff, 2001).</Paragraph>
    </Section>
    <Section position="2" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
4.2 Domain Dependence and Insufficient
Sense Coverage
</SectionTitle>
      <Paragraph position="0"> While it is encouraging to find out that the parallel text sense tags are of high quality, we are still left with the task of explaining the difference between M1 and P1 for the set of difficult nouns. Our further investigation reveals that the accuracy difference between M1 and P1 is due to the following two reasons: domain dependence and insufficient sense coverage.</Paragraph>
      <Paragraph position="1"> Domain Dependence The accuracy figure of M1 for each noun is obtained by training a WSD classifier on the manually sense-tagged training data (with lumped senses) provided by SENSEVAL-2 organizers, and testing on the corresponding official test data (also with lumped senses), both of which come from similar domains.</Paragraph>
      <Paragraph position="2"> In contrast, the P1 score of each noun is obtained by training the WSD classifier on a mixture of six parallel corpora, and tested on the official SENSEVAL-2 test set, and hence the training and test data come from dissimilar domains in this case.</Paragraph>
      <Paragraph position="3"> Moreover, from the &amp;quot;docsrc&amp;quot; field (which records the document id that each training or test example originates) of the official SENSEVAL-2 training and test examples, we realized that there are many cases when some of the examples from a document are used as training examples, while the rest of the examples from the same document are used as test examples. In general, such a practice results in higher test accuracy, since the test examples would look a lot closer to the training examples in this case.</Paragraph>
      <Paragraph position="4"> To address this issue, we took the official SENSEVAL-2 training and test examples of each noun w and combined them together. We then randomly split the data into a new training and a new test set such that no training and test examples come from the same document. The number of training examples in each sense in such a new training set is the same as that in the official training data set of w.</Paragraph>
      <Paragraph position="5"> A WSD classifier was then trained on this new training set, and tested on this new test set. We conducted 10 random trials, each time splitting into a different training and test set but ensuring that the number of training examples in each sense (and thus the sense distribution) follows the official training set of w. We report the average accuracy of the 10 trials. The accuracy figures for the set of difficult nouns thus obtained are listed in the column labeled M2 in Table 3.</Paragraph>
      <Paragraph position="6"> We observed that M2 is always lower in value compared to M1 for all difficult nouns. This suggests that the effect of training and test examples coming from the same document has inflated the accuracy figures of SENSEVAL-2 nouns.</Paragraph>
      <Paragraph position="7"> Next, we randomly selected 10 sets of training examples from the parallel corpora, such that the number of training examples in each sense followed the official training set of w. (When there were insufficient training examples for a sense, we just used as many as we could find from the parallel corpora.) In each trial, after training a WSD classifier on the selected parallel text examples, we tested the classifier on the same test set (from SENSEVAL-2 provided data) used in that trial that generated the M2 score. The accuracy figures thus obtained for all the difficult nouns are listed in the column labeled P2 in Table 3.</Paragraph>
      <Paragraph position="8"> Insufficient Sense Coverage We observed that there are situations when we have insufficient training examples in the parallel corpora for some of the senses of some nouns. For instance, no occurrences of sense 5 of the noun circuit (racing circuit, a racetrack for automobile races) could be found in the parallel corpora. To ensure a fairer comparison, for each of the 10-trial manually sense-tagged training data that gave rise to the accuracy figure M2 of a noun w, we extracted a new subset of 10-trial (manually sense-tagged) training data by ensuring adherence to the number of training examples found for each sense of w in the corresponding parallel text training set that gave rise to the accuracy figure P2 for w. The accuracy figures thus obtained for the difficult nouns are listed in the column labeled M3 in Table 3. M3 thus gave the accuracy of training on manually sense-tagged data but restricted to the number of training examples found in each sense from parallel corpora.</Paragraph>
      <Paragraph position="9">  The difference between the accuracy figures of M2 and P2 averaged over the set of all difficult nouns is 0.140. This is smaller than the difference of 0.189 between the accuracy figures of M1 and P1 averaged over the set of all difficult nouns. This confirms our hypothesis that eliminating the possibility that training and test examples come from the same document would result in a fairer comparison. null In addition, the difference between the accuracy figures of M3 and P2 averaged over the set of all difficult nouns is 0.065. That is, eliminating the advantage that manually sense-tagged data have in their sense coverage would reduce the performance gap between the two approaches from 0.140 to 0.065. Notice that this reduction is particularly significant for the noun circuit. For this noun, the parallel corpora do not have enough training examples for sense 4 and sense 5 of circuit, and these two senses constitute approximately 23% in each of the 10-trial test set.</Paragraph>
      <Paragraph position="10"> We believe that the remaining difference of 0.065 between the two approaches could be attributed to the fact that the training and test examples of the manually sense-tagged corpus, while not coming from the same document, are however still drawn from the same general domain. To illustrate, we consider the noun channel where the difference between M3 and P2 is the largest. For channel, it turns out that a substantial number of the training and test examples contain the collocation &amp;quot;Channel tunnel&amp;quot; or &amp;quot;Channel Tunnel&amp;quot;. On average, about 9.8 training examples and 6.2 test examples contain this collocation. This alone would have accounted for 0.088 of the accuracy difference between the two approaches.</Paragraph>
      <Paragraph position="11"> That domain dependence is an important issue affecting the performance of WSD programs has been pointed out by (Escudero et al., 2000). Our work confirms the importance of domain dependence in WSD.</Paragraph>
      <Paragraph position="12"> As to the problem of insufficient sense coverage, with the steady increase and availability of parallel corpora, we believe that getting sufficient sense coverage from larger parallel corpora should not be a problem in the near future for most of the commonly occurring words in a language.</Paragraph>
    </Section>
    <Section position="3" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
Related Work
</SectionTitle>
      <Paragraph position="0"> Brown et al. (1991) is the first to have explored statistical methods in word sense disambiguation in the context of machine translation. However, they only looked at assigning at most two senses to a word, and their method only asked a single question about a single word of context. Li and Li (2002) investigated a bilingual bootstrapping technique, which differs from the method we implemented here. Their method also does not require a parallel corpus.</Paragraph>
      <Paragraph position="1"> The research of (Chugur et al., 2002) dealt with sense distinctions across multiple languages. Ide et al. (2002) investigated word sense distinctions using parallel corpora. Resnik and Yarowsky (2000) considered word sense disambiguation using multiple languages. Our present work can be similarly extended beyond bilingual corpora to multilingual corpora.</Paragraph>
      <Paragraph position="2"> The research most similar to ours is the work of Diab and Resnik (2002). However, they used machine translated parallel corpus instead of human translated parallel corpus. In addition, they used an unsupervised method of noun group disambiguation, and evaluated on the English all-words task. Conclusion In this paper, we reported an empirical study to evaluate an approach of automatically acquiring sense-tagged training data from English-Chinese parallel corpora, which were then used for disambiguating the nouns in the SENSEVAL-2 English lexical sample task. Our investigation reveals that this method of acquiring sense-tagged data is promising and provides an alternative to manual sense tagging.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML