File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-1654_evalu.xml
Size: 7,373 bytes
Last Modified: 2025-10-06 13:59:50
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1654"> <Title>Random Indexing using Statistical Weight Functions</Title> <Section position="9" start_page="461" end_page="462" type="evalu"> <SectionTitle> 7 Results </SectionTitle> <Paragraph position="0"> Table 3 shows the results for the experiments extracting synonymy. The basic Random Indexing algorithm (FREQ) produces a DIRECT score of 2.87, and an INVR of 0.94. It is interesting that the only other linear weight, IDENTITY, produces more accurate results. This shows high frequency, low information contexts reduce the accuracy of Random Indexing. IDENTITY removes this effect by ignoring frequency, but does not address the information aspect. A more accurate weight will consider the information provided by a context in its weighting.</Paragraph> <Paragraph position="1"> There was a large variance in the effectiveness of the other weights and most proved to be detrimental to Random Indexing. TF-IDF was the worst, reducing the DIRECT score to 0.30 and the INVR to 0.07. TF-IDFy, which is a log-weighted alternative to TF-IDF, produced very good results. With the exception of DICELOG, adding an additional log factor improved performance (TF-IDFy, MILOG and TTESTLOG). Unrestricted ranges improved the MI family, but made no difference to TTEST. Grefenstette's variation on TF-IDF (GREF94) does not perform as well as TF-IDFy, and Lin's variations on MI+- (LIN98A, LIN98B) do not perform as well as MILOG+-.</Paragraph> <Paragraph position="2"> MILOG+- had a higher INVR than TF-IDFy, but a lower DIRECT score, indicating that it forces more correct results to the top of the results list, but also forces some correct results further down so that they no longer appear in the top 100.</Paragraph> <Paragraph position="3"> very large corpus The effect of high frequency contexts is increased further as we increase the size of the corpus. Table 5 presents results using the 2 billion word corpus used by Curran (2004). This consists of the non-speech portion of the BNC, the Reuter's Corpus Volume 1 and most of the English news holdings of the LDC in 2003. Contexts were extracted as presented in Section 4. A frequency cut-off of 100 was applied and the values d = 1000 and epsilon1 = 5 for FREQ and epsilon1 = 10 for the improved weights were used.</Paragraph> <Paragraph position="4"> We see that the very large corpus has reduced the accuracy of frequency weighted Random Indexing. In contrast, our two top performers have both substantially increased in accuracy, presenting a 75-100% improvment in performance over FREQ. MILOG+- is more accurate than TF-IDFy for both measures of accuracy now, indicating it is a better weight function for very large data sets.</Paragraph> <Section position="1" start_page="461" end_page="462" type="sub_section"> <SectionTitle> 7.1 Bilingual Lexicon Acquisition </SectionTitle> <Paragraph position="0"> When the same function were applied to the bilingual lexicon acquisition task we see substantially different results: neither the improvement nor the extremely poor results are found (Table 4).</Paragraph> <Paragraph position="1"> context In the English-German corpora we replicate Sahlgren and Karlgren's (2005) results, with a precision of 58%. This has a DIRECT score of 6.1 and an INVR of 0.97. The only weight to make an improvement is TF-IDFy, which has a DIRECT score of 6.3, but a lower INVR and all weights perform worse in at least one measure.</Paragraph> <Paragraph position="2"> Our results for the Spanish-Swedish corpora show similar results. Our accuracy is down from that in Sahlgren and Karlgren (2005). This is explained by our application of the frequency cut-off to both the source and target languages. There are more weights with higher accuracies, and fewer with significantly lower accuracies.</Paragraph> </Section> <Section position="2" start_page="462" end_page="462" type="sub_section"> <SectionTitle> 7.2 Smaller Corpora </SectionTitle> <Paragraph position="0"> The absence of a substantial improvement in bilingual lexicon acquisition requires further investigation. Three main factors differ between our mono-lingual and bilingual experiments: that we are smoothing a homogeneous data set in our mono-lingual experiments and a heterogeneous data set in our bilingual experiments; we are using local grammatical contexts in our monolingual experiments and paragraph contexts in our bilingual experiments; and, the volume of raw data used in our monolingual experiments is many times that used in our bilingual experiments.</Paragraph> <Paragraph position="1"> Figure 1 presents results for corpora extracted from the BNC using the window-based context.</Paragraph> <Paragraph position="2"> Results are shown for the original Random Indexing (FREQ) and using IDENTITY, MILOG+- and TF-IDFy, as well as for the full vector measurement using JACCARD measure and the TTEST+weight (Curran, 2004). Of the Random Indexing results FREQ produces the lowest overall results. It performs better than MILOG+- for very small corpora, but produces near constant results for greater corpus sizes. Curran and Moens (2002) found that increasing the volume of input data increased the accuracy of results generated using a full vector space model. Without weighting, Random Indexing fails this, but after weighting is applied Curran and Moens' results are confirmed.</Paragraph> <Paragraph position="3"> The quality of context extracted influences how weights perform individually, but Random Indexing using weights still outperforms not using weights. The relative performance of MILOG+has been reduced when compared with TF-IDFy, but is still greater then FREQ.</Paragraph> <Paragraph position="4"> Gorman and Curran (2006) showed Random Indexing to be much faster than full vector space techniques, but with a 46-56% reduction in accuracy compared to using JACCARD and TTEST+-.</Paragraph> <Paragraph position="5"> Using the MI+- weight kept the improvement in speed but with only a 10-18% reduction in accuracy. When JACCARD and TTEST+- are used with our low quality contexts they perform consistently worse that Random Indexing. This indicates Random Indexing is stable in the presence of noisy data. It would be interesting to further compare these results to those produced by LSA.</Paragraph> <Paragraph position="6"> The results we have presented have shown that applying weights to Random Indexing can improve its performance for thesaurus extraction tasks. This improvement is dependent on the volume of raw data used to generate the context information. It is less dependent on the quality of contexts extracted.</Paragraph> <Paragraph position="7"> What we have not shown is whether this extends to the extraction of bilingual lexicons. The bilingual corpora have 12-16 million words per language, and for this sized corpora we already see substantial improvement with corpora as small as 5 million words (Figure 1). It may be that extracting paragraph-level contexts is not well suited to weighting, or that the heterogeneous nature of the aligned corpora reduces the meaningfulness of weighting. There is also the question as to whether it can be applied to all languages. There is a lack of freely available large-scale multi-lingual resources that makes this difficult to examine.</Paragraph> </Section> </Section> class="xml-element"></Paper>