File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1108_metho.xml

Size: 24,512 bytes

Last Modified: 2025-10-06 14:09:09

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1108">
  <Title>Combining Neural Networks and Statistics for Chinese Word sense disambiguation</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
Categorization, Text Summarization, Speech
</SectionTitle>
    <Paragraph position="0"> Recognition, Text to Speech, and so on.</Paragraph>
    <Paragraph position="1"> With rising of Corpus linguistics, the machine learning methods based on statistics are booming (Yarowsky, 1992). These methods draw the support from the high-powered computers, get the statistics of large real-world corpus, find and acquire knowledge of linguistics automatically. They deal with all change by invariability, thus it is easy to trace the evaluation and development of natural language. So the statistic methods of NLP has attracted the attention of professional researchers and become the mainstream bit by bit. Corpus-based Statistical approaches are Decision Tree (Pedersen, 2001), Decision List, Genetic Algorithm, Naive-Bayesian Classifier (Escudero, 2000), Maximum Entropy Model (Adam, 1996; Li, 1999), and so on.</Paragraph>
    <Paragraph position="2"> Corpus-based statistical approaches can be divided into supervised and unsupervised according to whether training corpus is sense-labeled text. Supervised learning methods have the good learning ability and can get better accuracy in WSD experiments (Schutze, 1998). Obviously the data sparseness problem is a bottleneck for supervised learning algorithm. If you want to get better learning and disambiguating effect, you can enlarge the size and smooth the data of training corpus. According to practical demand, it would spend much more time and manpower to enlarge the size of training corpus. Smoothing data is merely a subsidiary measure. The sufficient large size of training corpus is still the foundation to get a satisfied effect in WSD experiment.</Paragraph>
    <Paragraph position="3"> Unsupervised WSD never depend on tagged corpus and could realize the training of large real corpus coming from all kinds of applying field. So researchers begin to pay attention to this kind of methods (Lu, 2002). The kind of methods can overcome the sparseness problem in a degree.</Paragraph>
    <Paragraph position="4"> It is obvious that the two kinds of methods based on statistic have their own advantages and disadvantages, and cannot supersede each other.</Paragraph>
    <Paragraph position="5"> This paper researches the Chinese WSD using the model of artificial neural network and investigates the effect on WSD from input model of neural network constructed by the context words and the size of training corpus.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="31" type="metho">
    <SectionTitle>
2 BP Neural Network
</SectionTitle>
    <Paragraph position="0"> At the moment, there are about more than 30 kinds of artificial neural network (ANN) in the domain of research and application. Especially, BP neural network is a most popular model of ANN nowadays.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 The structure of BP Neural Network
</SectionTitle>
      <Paragraph position="0"> The BP model provides a simple method to calculate the variation of network performance cased by variation of single weight. This model contains not only input nodes and output nodes, but also multi-layer or mono-layer hidden nodes. Fig1.1 is a construction chart of triple-layer BP neural network. As it is including the weights modifying process from the output layer to the input layer resulting from the total errors, the BP neural network is called Error Back Propagation network.</Paragraph>
      <Paragraph position="1"> Fig. 1.1 BP Network Fig.1.1 The structure of BP neural network Except for the nodes of input layer, all nodes of other layers are non-linear input and output. So the feature function should be differential on every part of function. General speaking, we can choose the sigmoid, tangent inspired, or linear function as the feature function because they are convenient for searching and solving by gradient technique.</Paragraph>
      <Paragraph position="3"> The output of sigmoid function ranges between 0 and 1, increasing monotonically with its input.</Paragraph>
      <Paragraph position="4"> Because it maps a very large input domain to a small range of outputs, it is often referred to as the squashing function of the unit. The output layer and hidden layer should adopt the sigmoid inspired function under the condition of intervention on the output, such as confining the output between 0 and</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Back Propagation function of BP
</SectionTitle>
      <Paragraph position="0"> neural network The joint weights should be revised many times during the progress of the error propagating back in BP networks. The variation of joint weights every time is solved by the method of gradient descent. Because there is no objective output in hidden layer, the variation of joint weight in hidden layer is solved under the help of error back propagation in output layer. If there are many hidden layers, this method can reason out the rest to the first layer by analogy.</Paragraph>
      <Paragraph position="1"> 1) the variation of joint weights in output layer To calculate the variation of joint weights from input i'th to output k'th is as following:</Paragraph>
      <Paragraph position="3"> 2) the variation of joint weights in hidden layer To calculate the variation of joint weights from input j'th to output i'th is as following:</Paragraph>
      <Paragraph position="5"/>
      <Paragraph position="7"> 3. The construction of WSD model  Under the consideration of fact that only numerical data can be accepted by the input and output of neural network, if BP neural network is used on WSD, the prerequisite is to vector the part of semantic meaning (words or phrases) and sense. In the event of training BP model, the input vector P and objective vector O of WSD should be determined firstly. And then we should choose the construction of neural network that needs to be designed, say, how many layers is network, how many neural nodes are in every layer, and the inspired function of hidden layer and output layer. The training of model still needs the vector added weight, output, and error vector. The training is over when the sum of square errors is less than the objection of error. Or the errors of output very to adjust the joint weight back and repeat the training.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="2" type="sub_section">
      <SectionTitle>
3.1 To vector the vocabulary
</SectionTitle>
      <Paragraph position="0"> WSD depends on the context to judge the meaning of ambiguous words. So the input of model should be the ambiguous words and the contextual words round them. In order to vector the words in the context, the Mutual Information (MI) of ambiguous words and context should be calculated. So MI can show the opposite distance of ambiguous words and contextual words. MI can replace every contextual word. That is suitable to as the input model. The function of MI is as follow:  to appear together.</Paragraph>
      <Paragraph position="1"> The experimental corpus in this article stems from the People Daily of 1998. The extent is 123,882 lines (10,000,000 words), including 121,400 words and phrases.</Paragraph>
    </Section>
    <Section position="4" start_page="2" end_page="31" type="sub_section">
      <SectionTitle>
3.2 The pretreatment of BP network model
</SectionTitle>
      <Paragraph position="0"> The supervised WSD need artificial mark of meaning. But it is time consuming to mark artificially. So it is difficult to get the large scope and high quality training linguistic corpus. In order to overcome this difficulty and get large enough experimental linguistic corpuses, we should turn to seek the new way.</Paragraph>
      <Paragraph position="1"> We use pseudoword in place of the real word. That can get the arbitrary large experimental corpus according to the real demand.</Paragraph>
      <Paragraph position="2">  Pseudoword is the artificial combination of several real words on the basis of experimental demand to form an unreal word that possesses many features of real words and instead of real word as the experimental object in natural language research.</Paragraph>
      <Paragraph position="3"> In the real world, one word has many meanings derives from the variation and flexible application of words. That needs a long-term natural evolution. Frankly speaking, that evolution never ceases at all times. For example, the word -' '(da3) extends some new uses in recent years. Actually, in the endless history river of human beings, the development and variation of words meaning are rapid so far as to be more rapid than the replacement of dictionaries sometimes. Usually that makes an awkward position when you use dictionary to define the words meanings. Definitely, it is inconvenient for the research of natural linguistics based on dictionary.</Paragraph>
      <Paragraph position="4"> But the meaning of pseudoword (Schutze, 1992) need not defined with the aid of dictionary and simulates the real ambiguous word to survey the effect of various algorithms of classified meanings. To form a pseudoword need the single meaning word as a morpheme.</Paragraph>
      <Paragraph position="5">  algorithm of pseudoword is single meaning and every living example is about equal to a pseudoword marked meaning in corpus. That is similar to the effect of artificial marked meaning. But the effect is more stable and reliable than artificial marked meaning. What's more, the scope of corpus can enlarge endless according to the demand to avoid the phenomenon of sparse data.</Paragraph>
      <Paragraph position="6"> To define the number of algorithm, we count the average number of meanings according to the large-sized Chinese dictionaries (Table 3.1). Table 3.2 show the overall number of ambiguous word and percentage of ambiguous word having 2~4 meanings in all ambiguous word. These two charts indicate that verb is most active in Chinese and its average number of meanings is most, about 2.56.</Paragraph>
      <Paragraph position="7"> The percentage of ambiguous word having 2~4 meanings is most in all ambiguous word.</Paragraph>
      <Paragraph position="8">  It should be based on context to determine the sense of ambiguous word. The model's input should be the vector of the ambiguous word and context words. It is well-known that the number of context  Table 3.2 the distributing of ambiguous word words showing on the both sides of ambiguous word is not fixed in different sentences. But the number of vectors needed by BP network is fixed. In other words, the number of neural nodes of input model is fixed in the training. If the extracting method of feature vector is (-M, +N) in context, in other words there are M vectors on the left of ambiguous word and N vectors on the right, the extraction of feature vectors must span the limit of sentences. If the number of feature vectors is not enough, the ambiguous words on the left and right boundaries of whole corpus do not participate in the training.</Paragraph>
      <Paragraph position="9"> According to the extracting method of feature vector (-M, +N), the vector of model input is as following: VShu Ru = {MI  total 4998 Table 3.3 the total number of the feature -vector sample of ambiguous word Training corpus are 105,000 lines, and each line is a paragraph, totally about 10,000,000 words. Table 3.3 shows the number of collected feature vector samples (the frequency of ambiguous word).</Paragraph>
    </Section>
    <Section position="5" start_page="31" end_page="31" type="sub_section">
      <SectionTitle>
3.3 The definition of output model
</SectionTitle>
      <Paragraph position="0"> Every ambiguous word has three meanings, totally eighteen meanings for six ambiguous words. Every ambiguous word trains a model and every model has three outputs showed by three-bit integer of binary system, such as the three meanings of ambiguous word W are showed as followed:</Paragraph>
      <Paragraph position="2"/>
    </Section>
    <Section position="6" start_page="31" end_page="31" type="sub_section">
      <SectionTitle>
3.4 The definition of network structure
</SectionTitle>
      <Paragraph position="0"> According to statistics, when (-M, +N) are (-8, +9) using the method of feature extraction, the cover percentage of effective information is more than 87% (Lu, 2001). However, if the sentence is very short, collecting the contextual feature words on the basis of (-8, +9) can include much useless information to the input model. Undoubtedly, that will increase more noisy effect and deduce the meaning-distinguish ability of verve network.</Paragraph>
      <Paragraph position="1"> This article makes an on-the-spot investigation of experimental corpus, a fairly integrated meaning unit (the marks of border including comma, semicolon, ellipsis, period, question mark, exclamation mark, and the like), which average length is between 9~10 words. So this article collects the contextual feature words on the basis of (-5, +5) in the experiments, 10 feature words available that calculate MI with each meaning of ambiguous word separately to get 30 vectors. All punctuation marks should be filtered while the feature words are collected. The input layer of neural network model is regarded as 30 neural nodes. The triple-layer neural network adopts the inspired S function. From that, the number of neural nodes in hidden layer is defined as 12 on the basis of experimental contrast, and 3 neural nodes in output layer. Hence, the structure of model is 30 x 12 x 3, and the precision of differential training is defined as 0.3 based on the experimental contrast.</Paragraph>
    </Section>
    <Section position="7" start_page="31" end_page="31" type="sub_section">
      <SectionTitle>
3.5 The test and training of model
</SectionTitle>
      <Paragraph position="0"> The experimental corpus appeared in front are 123,882 lines. It is divided to three parts according to the demand of experiment, C1 (15,000 lines), C2 (60,000 lines), and C3 (105,000 lines). The open test corpus is 18,882 lines.</Paragraph>
      <Paragraph position="1"> Table 3.3 tells us that there is a great disparity between the sample numbers of different ambiguous words in the experimental corpus of the same class. And the distribution of different meanings is not even for same ambiguous word.</Paragraph>
      <Paragraph position="2"> For the trained neural network has the good ability of differentiation for each word, the number of training sample should be about equal to each other for each meaning. So this experiment selects the least training samples. For example, there are 200 samples of the first meaning in training corpus, the second 400, and the third 500. To balance the input, each meaning merely has 200 samples to be elected for training.</Paragraph>
      <Paragraph position="3"> Three groups of training corpus can train 3 neural networks possessing different vectors for every ambiguous word and make the unclose and open test for these networks separately.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="31" end_page="31" type="metho">
    <SectionTitle>
4 The result of experiment
</SectionTitle>
    <Paragraph position="0"> In order to analyze the effect that the extent of training corpus influences the meaning distinguish ability of neural network, this article trains the model of neural network using the experimental</Paragraph>
    <Paragraph position="2"> , and makes the close and open test for 6 ambiguities separately. The close test means the corpus are same in test and training.</Paragraph>
    <Paragraph position="3"> The experiment is divided into two groups according to the extracting method of contextual feature words.</Paragraph>
    <Section position="1" start_page="31" end_page="31" type="sub_section">
      <SectionTitle>
4.1 The first experiment one
</SectionTitle>
      <Paragraph position="0"> Table 4.1 shows the result of the first experiment which extracts the contextual feature words using the method of (-5,+ 5).</Paragraph>
      <Paragraph position="1"> In addition, the first experiment investigates that the extent of training corpus (the number of training samples big or small) influences the ability to distinguish the models. The result of test for 6  for six ambiguities ambiguities is showed in table 4.2 (close test), table 4.3 (open test), and table 4.4. Considering the length of this article, table 4.2 and table 4.3 shows the detailed data, and table 4.4 is brief.</Paragraph>
      <Paragraph position="2">  in close test under the different training corpus</Paragraph>
    </Section>
    <Section position="2" start_page="31" end_page="31" type="sub_section">
      <SectionTitle>
4.2 The second experiment
</SectionTitle>
      <Paragraph position="0"> The second experiment investigates emphatically the effect that the method to collect the feature words influences the ability to distinguish BP model.</Paragraph>
      <Paragraph position="1">  in open test under the different training corpus There are many methods adopted in this experiment, including (-10,+ 10),(-3,+ 3),(-3, + 7),(-7,+ 3),(-4,+ 6)and(-6,+ 4). Merely the ambiguous words W  feature collecting method experimental objects in this group experiment. See table 4.5 for the correct percentage of WSD.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="31" end_page="31" type="metho">
    <SectionTitle>
5. Analysis and discussion
</SectionTitle>
    <Paragraph position="0"> See table 5.1 for the number of experimental corpus samples in experiment.</Paragraph>
    <Paragraph position="1"> According to the table 3.3 and 5.1, the frequency of the each meaning (morpheme) of ambiguous word showing in corpus is quite different. That accords with the distribution of the every meanings of ambiguous word. However, there is one different point that the frequency of the each meaning of ambiguous word is rather high (that is the outcome selected by morpheme.). In other words, there are many examples showing for the each meaning of ambiguous word in training and test corpus. On the contrast, the difference of frequency is quite  obvious for the each meaning of real ambiguous word, because some meanings are used in oral language. But that never or seldom appears in experimental corpus.</Paragraph>
    <Paragraph position="2"> The statistics can uncover this linguistic phenomenon. We find that the meaning of the most percentage of ambiguous word showing in the corpus is 83.54% on the whole percentage of each meaning. That illustrates the distribution of each meaning has a great disparity in real ambiguous word. Seeing that condition, to differentiate the meaning of ambiguous word is harder than that of real ambiguous word absolutely.</Paragraph>
    <Section position="1" start_page="31" end_page="31" type="sub_section">
      <SectionTitle>
5.1 The analysis and discussion of the first
experiment
</SectionTitle>
      <Paragraph position="0"> Table 4.1 records the results of close and open tests in detail and the training materials to get these results.</Paragraph>
      <Paragraph position="1"> Seeing from the experimental results, the correct percentage reaches 89.51% most (ambiguous word  ).</Paragraph>
      <Paragraph position="2"> The relationship of correct percentage and the extent of training corpus can be deduced from the experimental results of table 4.2, 4.3 and 4.4. The larger the extent of training corpus (the number of training sample), the larger the result of close test. It is obvious to see that from C1 to C2. From C2 to C3 one or two experimental results fluctuate more or less.</Paragraph>
      <Paragraph position="3"> With the growing of training sample, the experimental results of open test increase steadily, except ambiguous word W2 (a little bit difference). The experimental data prove the growing of training samples rise the correct percentage. However, when the rising reaches to a certain degree, more rising is not good for the improvement of model. What's more, the effect of noise is more and more remarkable. That decreases the model's ability of differentiation in a certain degree. On the other hand, after the growing of training corpus, the linguistic phenomenon around ambiguities is richer and richer, more and more complex. That makes it harder to determine the meaning.</Paragraph>
    </Section>
    <Section position="2" start_page="31" end_page="31" type="sub_section">
      <SectionTitle>
5.2 The analysis and discussion of the second
</SectionTitle>
      <Paragraph position="0"> experiment This article emphasizes on the collecting method of contextual feature words in experiment two, in other words, the effect that the different values of M and N influence the model of BP network. The experimental results (table 4.1 and 4.5) tell us that the context windows influence the correct percentage heavily. The correct percentage increases almost by leaps and bounds from (-3,+ 3) to(-5,+5). The discrepancy is obvious despite close test or open test. The correct percentage increase again to(-10,+ 10), in which the close test of ambiguous word W6 is more than 90% and 89.62% the close test, with the exception of W1 which open test is slightly special. That illustrates the more widely the context windows open, the more the effective information is caught to benefit the WSD more.</Paragraph>
      <Paragraph position="1"> Comparing the four feature methods of collection, including (-3,+ 7),(-7,+ 3),(-4,+ 6) and(-6,+ 4) with(-5,+ 5), the number of feature words besides the ambiguous word is various and the experimental results (table 4.1 and 4.5) are not same, although the windows are same. Among them, the correct percentage of (-5,+ 5)is the highest. And that of(-4,+ 6)and(-6,+ 4)is better than that of(-3,+ 7)and(-7,+ 3)a bit. That shows the more balanceable the feature words besides ambiguous word, the more advantageous to judge meaning, and the better the experimental results. In addition, some experimental results of open test are better than that of close test. The main reason is the experimental corpus of open test is smaller than training corpus. So the contextual meanings of ambiguous word in experimental corpus are rather explicit. Thereby, that explains why should be this kind of experimental result.</Paragraph>
    </Section>
    <Section position="3" start_page="31" end_page="31" type="sub_section">
      <SectionTitle>
5.3 Conclusions
</SectionTitle>
      <Paragraph position="0"> Considering the analysis of experimental data, the conclusions are as following: First, the artificial model of neural network established in this article has good ability of differentiation for Chinese meaning.</Paragraph>
      <Paragraph position="1"> Next, higher correct percentage of WSD stems from the large enough corpus.</Paragraph>
      <Paragraph position="2"> At last, the larger the windows of contextual feature words, the more the effective information. At the same time, the more balanceable the number of feature words beside the ambiguous word, the more beneficial that for WSD.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="31" end_page="31" type="metho">
    <SectionTitle>
6 Concluding remarks
</SectionTitle>
    <Paragraph position="0"> Although the BP network is a classified model applied extensively, the report of research on WSD about it is seldom. Especially the report about the Chinese WSD is less, and only one report (Zhang, 2001) is available in internal reports.</Paragraph>
    <Paragraph position="1"> Zhang (2001) uses 96 semantic classes to instead the all words in training corpus according to the TongyiciCilin. The input model is the codes of semantic class of contextual words and ambiguities.</Paragraph>
    <Paragraph position="2"> The experiment of WSD merely makes for one phrase ' '(cai2liao4) in this document and the correct percentage of open test is 80.4%. ' ' has 3 meanings and that is similar to the ambiguities structured in my article.</Paragraph>
    <Paragraph position="3"> Using BP for Chinese WSD, the key point and difficulty are on the determination of input model. The performance of input model may influence the construction of BP network and the output result directly.</Paragraph>
    <Paragraph position="4"> We make the experiment on the input of BP network many times and finally find the input model introduced as above (table 3.1) which test result is satisfied.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML