XML Viewer - h93-1053

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/h93-1053_metho.xml
Size: 11,544 bytes
Last Modified: 2025-10-06 14:13:25
<?xml version="1.0" standalone="yes"?>
<Paper uid="H93-1053">
  <Title>Augmenting Lexicons Automatically: Clustering Semantically Related Adjectives</Title>
  <Section position="4" start_page="273" end_page="275" type="metho">
    <SectionTitle>
3. RESULTS
</SectionTitle>
    <Paragraph position="0"> We tested our system on a 8.2 million word corpus of stock market reports from the AP news wire****. A subset of 21 of the adjectives in the corpus (Figure 1) was selected for practical reasons (mainly for keeping the evaluation task tractable). We selected adjectives that have one modified noun in common (problem) to ensure some semantic relatedness, and we included only adjectives that occurred frequently so that our similarity measure would be meaningful.</Paragraph>
    <Paragraph position="1"> The partition produced by the system for 9 clusters appears in Figure 2. Since the number of clusters is not determined by the system, we present the partition with a similar number of clusters as humans used for the same set of adjectives (the average number of clusters in the human-made models was 8.56).</Paragraph>
    <Paragraph position="2"> Before presenting a formal evaluation of the results, we note that this partition contains interesting data. First, the results contain two clusters of gradable adjectives which fall in the same scale. Groups 5 and 8 contain adjectives that indicate the size, or scope, of a problem; by augmenting the system with tests to identify when an adjective is gradable, we could separate out these two groups from other potential scales, and perhaps consider combining them. Second, groups 1 and 6 clearly identify separate sets of non-gradable, non-scalar adjectives; the former group contains adjectives that describe the geographical scope of the problem, while the latter contains adjectives that .... We thank Karen Kukich and Frank Smadja for providing us access to the corpus.</Paragraph>
    <Paragraph position="3">  Answer should be Yes Answer should be No The system says Yes a b The system says No c d  1. foreign global international 2. old 3. potential Table 1: Contingency table model for evaluation.</Paragraph>
    <Paragraph position="4"> 4. EVALUATION 4. new real unexpected 5. little staggering 6. economic financial mechanical political technical 7. antitrust 8. big major serious severe 9. legal  specify the nature of the problem. It is interesting to note here that the expected number of adjectives per cluster is ~=2.33, and the clustering algorithm employed discourages long groups; nevertheless, the evidence for the adjectives in group 6 is strong enough to allow the creation of a group with more than twice the expected number of members. Finally, note that even in group 4 which is the weakest group produced, there is a positive semantic cor-To evaluate the performance of our system we compared its output to a model solution for the problem designed by humans. Nine human judges were presented with the set of adjectives to be partitioned, a description of the domain, and a simple example. They were told that clusters should not overlap but they could select any number of clusters* For our scoring mechanism, we converted the comparison of two partitions to a series of yes-no questions, each of which has a correct answer (as dictated by the model) and an answer assigned by the system. For each pair of adjectives, we asked if they fell in the same cluster (&amp;quot;yes&amp;quot;) or not (&amp;quot;no&amp;quot;). Since human judges did not always agree, we used fractional values for the correctness of each answer instead of 0 (&amp;quot;incorrect&amp;quot;) and 1 (&amp;quot;correct&amp;quot;). We used multiple human models for the same set of adjectives and defined the correctness of each answer as the relative frequency of the association between the two adjectives among the human models. We then sum these correctness values; in the case of perfect agreement between the models, or of only one model, the measures reduce to their original definition.</Paragraph>
    <Paragraph position="5"> Then, the contingency table model \[12\], widely used in Information Retrieval, is applicable. Referring to the classification of the yes-no answers in Table 1, the following relation between the adjectives new and unexpected. To summarize, the system seems to be able to identify many of the existent semantic relationships among the adjectives, while its mistakes are limited to creating singleton groups containing adjectives that are related to other adjectives in the test set (e.g., missing the semantic associations between new-old and potential-real) and &amp;quot;recognizing&amp;quot; a non-significant relationship between real and new-unexpected in group 4.</Paragraph>
    <Paragraph position="6"> We produced good results with relatively little data; the accuracy of the results can be improved if a larger, homogeneous corpus is used to provide the raw data. Furthermore, some of the associations between adjectives that the system reports appear to be more stable than others, e.g. when we vary the number of clusters in the partition. We have noticed that adjectives with a higher degree of semantic content (e.g. international or severe) appear to form more stable associations than relatively semantically empty adjectives (e.g. little or real). This observation can be used to actually filter out the adjectives which are too general to be meaningfully clustered in groups.</Paragraph>
    <Paragraph position="7"> measures are defined :</Paragraph>
    <Paragraph position="9"> In other words, recall is the percentage of correct &amp;quot;yes&amp;quot; answers that the system found among the model &amp;quot;yes&amp;quot; answers, precision is the percentage of correct &amp;quot;yes&amp;quot; answers among the total of &amp;quot;yes&amp;quot; answers that the system reported, and fallout is the percentage of incorrect &amp;quot;~e.s.'.' answers relative to the total number of &amp;quot;no&amp;quot; answers We also compute a combined measure for recall and precision, the F-measure \[13\], which always takes a value between the values of recall and precision, and is higher when recall and precision are closer; it is defined as *****Another measure used in information retrieval, overgeneration, is in our case always equal to (100 - precision)%,</Paragraph>
    <Paragraph position="11"> where 13 is the weight of recall relative to precision; we use ~=1.0, which corresponds to equal weighting of the two measures.</Paragraph>
    <Paragraph position="12"> The results of applying our evaluation method to the system output (Figure 2) are shown in Table 2, which also includes the scores obtained for several other sub-optimal choices of the number of clusters. We have made these observations related to the evaluation mechanism : 1. Recall is inversely related to fallout and precision. Decreasing the number of clusters generally increases the recall and fallout and simultaneously decreases precision.</Paragraph>
    <Paragraph position="13"> 2. We have found fallout to be a better measure overall than precision, since, in addition to its decision-theoretic advantages \[12\], it appears to be more consistent across evaluations of partitions with different numbers of clusters.</Paragraph>
    <Paragraph position="14"> This has also been reported by other researchers in different evaluation problems \[14\].</Paragraph>
    <Paragraph position="15"> 3. For comparison, we evaluated each human model against all the other models, using the above evaluation method; the results ranged from 38 to 72% for recall, 1 to 12% for fallout, 38 to 81% for precision, and, covering a remarkably short range, 49 to 59% for the F-measure, indicating that the performance of the system is not far behind human performance. null Finally, before interpreting the scores produced by our evaluation module, we need to understand how they vary as the partition gets better or worse, and what are the limits of their values. Because of the multiple models used, perfect scores are not attainable. Also, because each pair of adjectives in a cluster is considered an observed association, the relationship between the number of associations produced by a cluster and the number of adjectives in the cluster is not linear (a cluster with k adjectives will produce (2k)=O(k 2) associations). This leads to lower values of recall, since moving a single adjective out of a cluster with k elements in the model will cause the system to miss k-1 associations. In general, defining a scoring mechanism that compares one partition to another is a hard problem.</Paragraph>
    <Paragraph position="16"> To quantify these observations, we performed a Monte Carlo analysis\[15\] for the evaluation metrics, by repeatedly creating random partitions of the sample adjectives and evaluating the results. Then we estimated a (smoothed) probability density function for each metric from the resulting histograms; part of the results obtained are shown in Figure 3 for F-measure and fallout using 9 clusters. We observed that the system's performance (indicated by a square in the diagrams) was significantly better than what we would expect under the null hypothesis of random performance; the probability of getting a better partition than the system's is extremely small for all metrics (no occurrence in 20,000 trials) except for fallout, for which a random system may be better 4.9% of the time. The estimated density functions also show that the metrics are severely constrained by the structure imposed by the clustering as they tend to peak at some point and then fall rapidly.</Paragraph>
  </Section>
  <Section position="5" start_page="275" end_page="276" type="metho">
    <SectionTitle>
5. CONCLUSIONS AND FUTURE WORK
</SectionTitle>
    <Paragraph position="0"> We have described a system for extracting groups of semantically related adjectives from large text corpora.</Paragraph>
    <Paragraph position="1"> Our evaluation reveals that it has significantly high performance levels, comparable to human models. Its results can be filtered to produce scalar adjectives that are applicable in any given domain.</Paragraph>
    <Paragraph position="2"> Eventually, we plan to use the system output to augment adjective entries in a lexicon and test the augmented lexicon in an application such as language generation. In addition, we have identified many directions for improving the quality of our output: * Investigating non-linear methods for converting similarities to dissimilarities.</Paragraph>
    <Paragraph position="3"> * Experimenting with different evaluation models, preferably ones based on the goodness of each cluster and not of each association. null * Developing methods for automatically selecting the desired number of clusters for the produced partition. Although this is a particularly hard problem, a steepest-descent method based on the tangent of the objective function may offer a solution.</Paragraph>
    <Paragraph position="4"> * Investigating additional sources of linguistic  and fallout with 9 clusters.</Paragraph>
    <Paragraph position="5"> knowledge, such as the use of conjunctions and adverb-adjective pairs.</Paragraph>
    <Paragraph position="6"> * Augmenting the system with tests particular to scalar adjectives; for example, exploiting gradability, checking whether two adjectives are antonymous (essentially developing tests in the opposite direction of the work by Justeson and Katz \[16\]), or comparing the relative semantic strength of two adjectives.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML