XML Viewer - c04-1145

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1145_metho.xml
Size: 19,077 bytes
Last Modified: 2025-10-06 14:08:47
<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1145">
  <Title>Morpheme-based Derivation of Bipolar Semantic Orientation of Chinese Words</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3. SO from Association-PMI
</SectionTitle>
    <Paragraph position="0"> Turney (2003) examined SO-PMI (Pointwise Mutual Information) and SO-LSA (Latent Semantic Analysis). SO-PMI will be our focus in the following parts. PMI is defined as: PMI(word1, word2)=log2( )()( )&amp;(  wordpwordp wordwordp ) where p(word1 &amp; word2) is the probability that word1 and word2 co-occur. If the words are statistically independent, the probability that they co-occur is given by the product p(word1) p(word2). The ratio between p(word1 &amp; word2) and p(word1) p(word2) is a measure of the degree of statistical dependence between the words. The SO of a given word is calculated from the strength of its association with a set of positive words, minus the strength of its association with a set of negative words. Thus the SO of a word, word, is calculated by SO-PMI as follows:</Paragraph>
    <Paragraph position="2"> where Pwords is a set of 7 positive paradigm words (good, nice, excellent, positive, fortunate, correct, and superior) and Nwords is a set of 7 negative paradigm words (bad, nasty, poor, negative, unfortunate, wrong, and inferior). Those 14 words were chosen by intuition and based on opposing pairs (good/bad, excellent/poor, etc.).</Paragraph>
    <Paragraph position="3"> The words are rather insensitive to context, i.e., 'excellent' is positive in almost all contexts.</Paragraph>
    <Paragraph position="4"> A word, word, is classified as having a positive SO when SO-PMI(word) is positive and a negative SO when SO-PMI(word) is negative.</Paragraph>
    <Paragraph position="5"> Turney (2003) used the Alta Vista Advanced search engine with a NEAR operator, which constrains the search to documents that contain the words within ten words of one another, in either order. Three corpora were tested. AV-ENG is the largest corpus covering 350 million web pages (English only) indexed by Alta Vista. The medium corpus is a 2% subset of AV-ENG corpus called AV-CA (Canadian domain only). The smallest corpus TASA is about 0.5% of AV-CA and contains various short documents.</Paragraph>
    <Paragraph position="6"> One of the lexicons used in Turney's experiment is the GI lexicon (Stone et al., 1966), which consists of 3,596 adjectives, adverbs, nouns, and verbs, 1,614 positive and 1,982 negative.</Paragraph>
    <Paragraph position="7"> Table 1 shows the precision of SO-PMI with the GI lexicon in the three corpora.</Paragraph>
    <Paragraph position="8">  The strength (absolute value) of the SO was used as a measure of confidence that the words will be correctly classified. Test set words were sorted in descending order of the absolute value of their SO and the top ranked words (the highest confidence words) were then classified. For example, the second row (starting with 75%) in table 1 shows the precision when the top 75% were classified and the last 25% (with lowest confidence) were ignored. We will employ this measure of confidence in the following experiments.</Paragraph>
    <Paragraph position="9"> Turney concluded that SO-PMI requires a large corpus (hundred billion words), but it is simple, easy to implement, unsupervised, and it is not restricted to adjectives.</Paragraph>
    <Paragraph position="10"> 4. Experiment with Chinese Words In the following experiments, we applied Turney's method to Chinese. The algorithm was run with 20 and then 40 paradigm words for comparison. The experiment details include: NEAR Operator: it was applied to constrain the search to documents that contain the words within ten words of one another, in either order. Corpus: the LIVAC synchronous corpus (Tsou et al., 2000, http://www.livac.org) was used. It covers 9-year news reports of Chinese communities including Hong Kong, Beijing and Taiwan, and we used a sub-corpus with about 34 million word tokens and 410k word types.</Paragraph>
    <Paragraph position="11"> Test Set Words: a combined set of two dictionaries of polarized words (Guo, 1999, Wang, 2001) was used to evaluate the results. While LIVAC is an enormous Chinese corpus, its size is still far from the hundred-billion-word corpus used by Turney. It is likely that some words in the combined set are not used in the 9-year corpus. To avoid a skewed recall, the number of test set words used in the corpus is given in table 2. In other words, the recall can be calculated by the total number of words used in the corpus, but not by that recorded in the dictionaries. The difference between two numbers is just 100.</Paragraph>
    <Paragraph position="12"> Polarity Total no. of the test set words  chosen using intuition and based on opposing pairs, as Turney (2003) did. The first experiment was conducted with 10 positive and 10 negative paradigm words, as follows, Pwords: L (honest), # (clever), O (sufficient), (lucky), % (right), -- (excellent), (prosperous), (kind), , (brave), P (humble) Nwords: (hypocritical), a (foolish), &amp;quot; R (deficient), l (unlucky), Q (wrong), m (adverse), (unsuccessful), v J (violent), &lt;&lt; (cowardly), &lt; (arrogant) The experiment was then repeated by increasing the number of paradigm words to 40. The paradigm words added are: Pwords: 5 I (mild), P (favourable), &lt;&lt;  (successful), % (positive), L (active), W (optimistic), / (benign), L (attentive), ; l (promising), e (incorrupt) Nwords: ? ^ (radical), l (unfavourable), (failed), (negative), a (passive), (pessimistic), / (malignant), (inattentive), (indifferent), M (corrupt)</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Results
</SectionTitle>
      <Paragraph position="0"> Tables 3 and 4 show the precision and recall of SO-PMI by two sets of paradigm words.</Paragraph>
      <Paragraph position="1">  The results of both sets gave a satisfactory precision of 80% even in 100% confidence. However, the recall was just 45.56% under the 20-word condition, and rose to 59.57% under the 40-word condition. The 15% rise was noted.</Paragraph>
      <Paragraph position="2"> To further improve the recall performance, we experimented with a modified algorithm based on the distinct features of Chinese morphemes.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5. Experiment with Chinese Morphemes
</SectionTitle>
    <Paragraph position="0"> Taking morphemes to be smallest linguistic meaningful unit, Chinese morphemes are mostly monosyllabic and single characters, although there are some exceptional poly-syllabic morphemes like *`* (grape), (coffee), which are mostly loanwords. In the following discussion, we consider morphemes to be monosyllabic and represented by single characters.</Paragraph>
    <Paragraph position="1"> It is observed that many poly-syllabic words with the same SO incorporate a common set of morphemes. The fact suggests the possibility of using paradigm morphemes instead of words.</Paragraph>
    <Paragraph position="2"> Unlike English, the constituent morphemes of a Chinese word are often free-standing monosyllabic words. It is note-worthy that words in ancient Chinese were much more mono-morphemic than modern Chinese. The evolution from monosyllabic word to disyllabic word may have its origin in the phonological simplification which has given rise to homophony, and which has affected the efficacy of communication. To compensate for this, many more related disyllabic words have appeared in modern Chinese (Tsou, 1976). There are three basic constructions for deriving disyllabic words in  Chinese, including: (1) combination of synonyms or near synonyms ( 5 f, warm, genial, 5=warm, mild, f =warm, genial) (2) combination of semantically related morphemes ( P -, P=affair, -=circumstances) (3) The affixation of minor suffixes which serve no primary grammatical function ( 6,  =village, 6=zi, suffix) The three processes for deriving disyllabic morphemes in Chinese outlined here should be viewed as historical processes. The extent to which such processes may be realized by native speakers to be productive synchronically bears further exploration. Of the three processes, the first two, i.e., synonym and near-synonym compounding, are used frequently by speakers for purposes of disambiguation. In view of this development, the evolution from monosyllabic words in ancient Chinese to disyllabic words in modern Chinese does not change the inherent meaning of the morphemes (words in ancient Chinese) in many cases. The SO of a word often conforms to that of its morphemes.</Paragraph>
    <Paragraph position="3"> In English, there are affixal morphemes like dis-, un- (negation prefix), or -less (suffix meaning short-age), -ful (suffix meaning 'to have a property of'), we can say 'careful' or 'careless' to expand the meaning of 'care'. However, it is impossible to construct a word like '*ful-care', '*less-care'. However, in Chinese, the position of a morpheme in many disyllabic words is far more flexible in the formation of synonym and near-synonym compound words. For instance, ' b'(honor) is a part of two similar word ' b(&amp;' (honor-bright) and ' v b'(outstanding-honor). Morphemes in Chinese are like a 'zipped file' of the same file types. When it unzips, all the words released have the same SO.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Probability of Constituent Morphemes
</SectionTitle>
      <Paragraph position="0"> of Words with the Same SO Most morphemes can contribute to positive or negative words, regardless of their inherent meaning. For example, ' ' (luck) has inherently a positive meaning, but it can construct both positive word ' 1 ' (lucky) or a negative word ' ' (unlucky). Thus it is not easy to define the paradigm set simply by intuition. But we can assign a probability value for a morpheme in forming polarized words on the basis of corpus data.</Paragraph>
      <Paragraph position="1"> The first step is to come up with possible paradigm morphemes by intuition in a large set of polarized words. With the LIVAC synchronous corpus, the types and tokens of the words constructed by the selected morphemes can easily be extracted. The word types, excluding proper nouns, are then manually-labeled as negative, neutral or positive. Then to obtain the probability that a polar morpheme generates words with the same SO, the tokens of the polarized word types carrying the morpheme are divided by the tokens of all word types carrying the morpheme. For example, given a negative morpheme, m1, the probability that it appears in negative words in token, P(m1, -ve) is given by:  Positive morphemes can be done likewise. Ten negative morphemes and ten positive morphemes were chosen as in table 5. Their values of P(morpheme, orientation) are all above 0.95.</Paragraph>
      <Paragraph position="2">  Those morphemes were extracted from a 5-year subset of the LIVAC corpus. A morpheme, free to construct new words, may construct hundreds of words but those words with extremely low frequency can be regarded as 'noise'. The 'noise' may be 'creative use' or even incorrect use. Thus, the number of ready-to-label word types formed from a particular morpheme was limited to 50, but it must cover 80% of the tokens of all word types carrying the morpheme in the corpus (i.e., 80% dominance). For example, if the morpheme m1 constructs 120 word types with 10,000 tokens, and the first 50 high-frequency words can reach 8,000 tokens, then the remaining 70 low-frequency word types, or noise, are discarded. Otherwise, the number of sampled words would be expanded to a number (over 50) fulfilling 80% dominance.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Results and Evaluation
</SectionTitle>
      <Paragraph position="0"> In table 6, the precision of 80.23% is slightly better than 79.96% of the 20-word condition, and just 1% lower than that of the 40-word condition. However, the recall drastically increases from 45.56%, or 59.57% under the 40-word condition, to 85.03%.</Paragraph>
      <Paragraph position="1"> In other words, the algorithm run with 20 Chinese paradigm morphemes resulted not only in high precision but also much higher recall than Chinese paradigm words in the same corpus.</Paragraph>
      <Paragraph position="2">  paradigm morpheme test set Since the morphemes were chosen from a subset of the corpus for evaluation, we repeated the experiment in a separate 1-year corpus (20012002). The results in table 7 reflect a similar pattern in the two corpora - both words and morphemes can get high precision, but morphemes can double the recall of words.</Paragraph>
      <Paragraph position="3">  SO-PMI of 40 paradigm words and 20 paradigm morphemes in 1-year corpus It is assumed that a smaller corpus easily leads to the algorithm's low recall because many low-frequency words in the test set barely associate with the paradigm words. To examine the assumption, the results were further analyzed with the frequency of the test set words. First, the occurrence of the test set words in the 9-year corpus was counted, then the median of the frequency, 44 in this case, was taken. The results were divided into two sections from the median value, and the recall of two sections was calculated respectively, as in table 8.</Paragraph>
      <Paragraph position="5"> of high-frequency and low-frequency words The results showed that high-frequency words could be largely extracted by the algorithm with both morphemes (99.80% recall) and words (89.45% recall). However, paradigm words gave 26.55% recall of low-frequency words, whereas paradigm morphemes gave 67.66%. They showed that morphemes outperform words in the retrieval of low-frequency words.</Paragraph>
      <Paragraph position="6"> Colossal corpora like Turney's hundred-billion-word corpus can compensate for the low performance of paradigm words in low-frequency words. Such a large corpus has been easilyaccessible since the emergence of internet, but it is not cost-effective to use the Chinese texts from the internet because those texts are not segmented.</Paragraph>
      <Paragraph position="7"> Another way of compensation is the expansion of paradigm words, but doubling the number of paradigm words just raised the recall from 45.56% to 59.57%, as shown in section 4. The supervised cost is not reasonable if the number of paradigm words is further expanded.</Paragraph>
      <Paragraph position="8"> Morphemes, or single characters in Chinese, naturally occur more frequently than words in an article, so 20 morphemes can be more discretelydistributed over texts than 20 or even 40 words. The results show that some morphemes always retain their inherent SO when becoming constituents in other derived words. Such morphemes are like a zipped file of the same SO, when the algorithm is run with 20 paradigm morphemes, it is actually run by thousands of paradigm words. Consequently, the recall could double while the high precision was not affected.</Paragraph>
      <Paragraph position="9"> It may be argued that the labour cost of defining the SO of 20 morphemes is not sufficiently low either. The following experiments will demonstrate that decreasing the number of morphemes can also give satisfactory results.</Paragraph>
      <Paragraph position="10"> 6. Experiment with different number of morphemes The following experiments were done respectively by decreasing the number of morphemes, i.e., 14 and 10 morphemes, chosen from table 5. The algorithm was then run with 3 groups of 6 different morphemes, in which the morphemes were different, and the combination of morphemes in each group was random. The morphemes in each group are shown in table 9. Other conditions for the experiments were unchanged.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.1 Results and Evaluation
</SectionTitle>
      <Paragraph position="0"> Table 10 shows the results with different number of morphemes, and table 11 shows those for different groups of 6 morphemes. For convenient comparison, the tables only show the results of the full test set, i.e., no threshold filtering.</Paragraph>
      <Paragraph position="1"> It is shown that the recall falls as the number of morphemes is reduced. However, even the average recall 66.29% under the 6-morpheme condition is still higher than that under the 40-word condition (59.57%). In section 5, it was evaluated that low recall could be attributed to the low frequency of test set words. Therefore, 6 to 10 morphemes are already ideal for deducing the SO of high-frequency words.</Paragraph>
      <Paragraph position="2">  test set words with 3 different groups of 6 morphemes The precision remains high from 20 morphemes to 6 morphemes, but from table 10 the precision varies with different sets of morphemes. Group 3 gave the lowest precision of 68.77%, whereas other groups gave a high precision close to 80%. The limited space of this paper cannot allow a detailed investigation into the reasons for this result, only some suggestions can be made.</Paragraph>
      <Paragraph position="3"> The precision may be related to the dominant lexical types of the words constructed by the morphemes and those of the test set words. Lexical types should be carefully considered in the algorithm for Chinese because Chinese is an isolating language - no form change. For example, the word ' !currency1' (recover) can appear in different positions of a sentence, such as the following examples extracted from the corpus:  (1)...' AE&amp;o '- ! !currency19 currency1&amp;++... (...American economy is gradually recovering...) (2) ... &amp;quot;1ae %0 q &amp;quot;&amp;o !currency10ae/ /-o (...most people is now pessimistic about the economy recovery) (3) ... 4 !currency19 = L J5 ,.</Paragraph>
      <Paragraph position="4">  (...decelerates the recovery, but also makes the future unpredictable.) English allows different forms of 'recovery, like 'recovery', 'recovering', 'recovered' but Chinese does not. Lexical types are thus an important factor for the precision performance. Another way of solving the problems of lexical types is the automatic extraction of meaningful units (Danielsson, 2003). Simply, meaningful units are some frequently-used patterns which consist of two or more words. It is useful to automatically extract the meaningful units with SO in future.</Paragraph>
      <Paragraph position="5"> Syntactic markers like negation, and creative uses like ironical expression of adding quotation marks can also affect the precision. Here is an example from the corpus: (' &amp;quot; AE q ('HONEST BUSINESSMAN'). The quotation mark (' ' in English) is to actually express the opposite meaning of words within the mark, i.e., HONEST means DISHONEST in this case. Such markers should further be handled, just as with the use of 'so-called'.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML