File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-1038_evalu.xml
Size: 11,952 bytes
Last Modified: 2025-10-06 13:59:37
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1038"> <Title>Efficient Unsupervised Discovery of Word Categories Using Symmetric Patterns and High Frequency Words</Title> <Section position="7" start_page="300" end_page="302" type="evalu"> <SectionTitle> 5 Evaluation </SectionTitle> <Paragraph position="0"> Lexical acquisition algorithms are notoriously hard to evaluate. We have attempted to be as thorough as possible, using several languages and both automatic and human evaluation. In the automatic part, we followed as closely as possible the methodology and data used in previous work, so that meaningful comparisons could be made.</Paragraph> <Section position="1" start_page="300" end_page="300" type="sub_section"> <SectionTitle> 5.1 Languages and Corpora </SectionTitle> <Paragraph position="0"> We performed in-depth evaluation on two languages, English and Russian, using three corpora, two for English and one for Russian. The first English corpus is the BNC, containing about 100M words. The second English corpus, Dmoz (Gabrilovich and Markovitch, 2005), is a web corpus obtained by crawling and cleaning the URLs in the Open Directory Project (dmoz.org), resulting in 68GB containing about 8.2G words from 50M web pages.</Paragraph> <Paragraph position="1"> The Russian corpus was assembled from many web sites and carefully filtered for duplicates, to yield 33GB and 4G words. It is a varied corpus comprising literature, technical texts, news, newsgroups, etc.</Paragraph> <Paragraph position="2"> As a preliminary sanity-check test we also applied our method to smaller corpora in Danish, Irish and Portuguese, and noted some substantial similarities in the discovered patterns. For example, in all 5 languages the pattern corresponding to 'x and y' was among the 50 selected.</Paragraph> </Section> <Section position="2" start_page="300" end_page="301" type="sub_section"> <SectionTitle> 5.2 Thresholds, Statistics and Examples </SectionTitle> <Paragraph position="0"> The thresholds TH,TC,TP,ZT ,ZB, were determined by memory size considerations: we computed thresholds that would give us the maximal number of words, while enabling the pattern access table to reside in main memory. The resulting numbers are 100,50,20,100,100.</Paragraph> <Paragraph position="1"> Corpus window size was determined by starting from a very small window size, defining at random a single window of that size, running the algorithm, and iterating this process with increased window sizes until reaching a desired vocabulary category participation percentage (i.e., x% of the different words in the corpus assigned into categories. We used 5%.) This process has only a negligible effect on running times, because each iteration is run only on a single window, not on the whole corpus.</Paragraph> <Paragraph position="2"> The table below gives some statistics. V is the total number of different words in the corpus. W is the number of words belonging to at least one of our categories. C is the number of categories (after merging and windowing.) AS is the average category size. Running times are in minutes on a 2.53Ghz Pentium 4 XP machine with 1GB memory. Note how small they are, when compared to (Pantel et al, 2004), which took 4 days for a smaller corpus using the same CPU.</Paragraph> <Paragraph position="4"> Among the patterns discovered are the ubiquitous 'x and y', 'x or y' and many patterns containing them. Additional patterns include 'from x to y', 'x and/or y' (punctuation is treated here as white space), 'x and a y', and 'neither x nor y'.</Paragraph> <Paragraph position="5"> We discover categories of different parts of speech. Among the noun ones, there are many whose precision is 100%: 37 countries, 18 languages, 51 chemical elements, 62 animals, 28 types of meat, 19 fruits, 32 university names, etc.</Paragraph> <Paragraph position="6"> A nice verb category example is {dive, snorkel, swim, float, surf, sail, canoe, kayak, paddle, tube, drift}. A nice adjective example is {amazing, awesome, fascinating, inspiring, inspirational, exciting, fantastic, breathtaking, gorgeous.}</Paragraph> </Section> <Section position="3" start_page="301" end_page="301" type="sub_section"> <SectionTitle> 5.3 Human Judgment Evaluation </SectionTitle> <Paragraph position="0"> The purpose of the human evaluation was dual: to assess the quality of the discovered categories in terms of precision, and to compare with those obtained by a baseline clustering algorithm.</Paragraph> <Paragraph position="1"> For the baseline, we implemented k-means as follows. We have removed stopwords from the corpus, and then used as features the words which appear before or after the target word. In the calculation of feature values and inter-vector distances, and in the removal of less informative features, we have strictly followed (Pantel and Lin, 2002). We ran the algorithm 10 times using k = 500 with randomly selected centroids, producing 5000 clusters. We then merged the resulting clusters using the same 50% overlap criterion as in our algorithm. The result included 3090, 2116, and 3206 clusters for Dmoz, BNC and Russian respectively.</Paragraph> <Paragraph position="2"> We used 8 subjects for evaluation of the English categories and 15 subjects for evaluation of the Russian ones. In order to assess the subjects' reliability, we also included random categories (see below.) The experiment contained two parts. In Part I, subjects were given 40 triplets of words and were asked to rank them using the following scale: (1) the words definitely share a significant part of their meaning; (2) the words have a shared meaning but only in some context; (3) the words have a shared meaning only under a very unusual context/situation; (4) the words do not share any meaning; (5) I am not familiar enough with some/all of the words.</Paragraph> <Paragraph position="3"> The 40 triplets were obtained as follows. 20 of our categories were selected at random from the non-overlapping categories we have discovered, and three words were selected from each of these at random. 10 triplets were selected in the same manner from the categories produced by k-means, and 10 triplets were generated by random selection of content words from the same window in the corpus.</Paragraph> <Paragraph position="4"> In Part II, subjects were given the full categories of the triplets that were graded as 1 or 2 in Part I (that is, the full 'good' categories in terms of sharing of meaning.) They were asked to grade the categories from 1 (worst) to 10 (best) according to how much the full category had met the expectations they had when seeing only the triplet.</Paragraph> <Paragraph position="5"> Results are given in Table 1. The first line gives the average percentage of triplets that were given scores of 1 or 2 (that is, 'significant shared meaning'.) The 2nd line gives the average score of a triplet (1 is best.) In these lines scores of 5 were not counted. The 3rd line gives the average score given to a full category (10 is best.) Interevaluator Kappa between scores 1,2 and 3,4 was 0.56, 0.67 and 0.72 for Dmoz, BNC and Russian respectively.</Paragraph> <Paragraph position="6"> Our algorithm clearly outperforms k-means, which outperforms random. We believe that the Russian results are better because the percentage of native speakers among our subjects for Russian was larger than that for English.</Paragraph> </Section> <Section position="4" start_page="301" end_page="302" type="sub_section"> <SectionTitle> 5.4 WordNet-Based Evaluation </SectionTitle> <Paragraph position="0"> The major guideline in this part of the evaluation was to compare our results with previous work having a similar goal (Widdows and Dorow, 2002). We have followed their methodology as best as we could, using the same WordNet (WN) categories and the same corpus (BNC) in addition to the Dmoz and Russian corpora5.</Paragraph> <Paragraph position="1"> The evaluation method is as follows. We took the exact 10 WN subsets referred to as 'subjects' in (Widdows and Dorow, 2002), and removed all multi-word items. We now selected at random 10 pairs of words from each subject. For each pair, we found the largest of our discovered categories containing it (if there isn't one, we pick another pair. This is valid because our Recall is obviously not even close to 100%, so if we did not pick another pair we would seriously harm the validity of the evaluation.) The various morphological forms of the same word were treated as one during the evaluation.</Paragraph> <Paragraph position="2"> The only difference from the (Widdows and Dorow, 2002) experiment is the usage of pairs rather than single words. We did this in order to disambiguate our categories. This was not needed in (Widdows and Dorow, 2002) because they had directly accessed the word graph, which may be an advantage in some applications.</Paragraph> <Paragraph position="3"> The Russian evaluation posed a bit of a problem because the Russian WordNet is not readily available and its coverage is rather small. Fortunately, the subject list is such that WordNet words 5(Widdows and Dorow, 2002) also reports results for an LSA-based clustering algorithm that are vastly inferior to the pattern-based ones.</Paragraph> <Paragraph position="4"> random categories) on the three corpora. See text for detailed explanations. could be translated unambiguously to Russian and words in our discovered categories could be translated unambiguously into English. This was the methodology taken.</Paragraph> <Paragraph position="5"> For each found category C containing N words, we computed the following (see Table 2): (1) Precision: the number of words present in both C and WN divided by N; (2) Precision*: the number of correct words divided by N. Correct words are either words that appear in the WN subtree, or words whose entry in the American Heritage Dictionary or the Britannica directly defines them as belonging to the given class (e.g., 'keyboard' is defined as 'a piano'; 'mitt' is defined by 'a type of glove'.) This was done in order to overcome the relative poorness of WordNet; (3) Recall: the number of words present in both C and WN divided by the number of (single) words in WN; (4) The number of correctly discovered words (New) that are not in WN. The Table also shows the number of WN words (:WN), in order to get a feeling by how much WN could be improved here. For each subject, we show the average over the 10 randomly selected pairs.</Paragraph> <Paragraph position="6"> Table 2 also shows the average of each measure over the subjects, and the two precision measures when computed on the total set of WN words. The (uncorrected) precision is the only metric given in (Widdows and Dorow, 2002), who reported 82% (for the BNC.) Our method gives 90.47% for this metric on the same corpus.</Paragraph> </Section> <Section position="5" start_page="302" end_page="302" type="sub_section"> <SectionTitle> 5.5 Summary </SectionTitle> <Paragraph position="0"> Our human-evaluated and WordNet-based results are better than the baseline and previous work respectively. Both are also of good standalone quality. Clearly, evaluation methodology for lexical acquisition tasks should be improved, which is an interesting research direction in itself.</Paragraph> <Paragraph position="1"> Examining our categories at random, we found a nice example that shows how difficult it is to evaluate the task and how useful automatic category discovery can be, as opposed to manual definition. Consider the following category, discovered in the Dmoz corpus: {nightcrawlers, chicken, shrimp, liver, leeches}. We did not know why these words were grouped together; if asked in an evaluation, we would give the category a very low score. However, after some web search, we found that this is a 'fish bait' category, especially suitable for catfish.</Paragraph> </Section> </Section> class="xml-element"></Paper>