File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-0604_intro.xml
Size: 4,440 bytes
Last Modified: 2025-10-06 14:03:13
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0604"> <Title>New Experiments in Distributional Representations of Synonymy</Title> <Section position="3" start_page="0" end_page="25" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Many text applications are predicated on the idea that shallow lexical semantics can be acquired through corpus analysis. Harris articulated the expectation that words with similar meanings would be used in similar contexts (Harris, 1968), and recent empirical work involving large corpora has borne this out. In particular, by associating each word with a distribution over the words observed in its context, we can distinguish synonyms from non-synonyms with fair reliability. This capability may be exploited to generate corpus-based thesauri automatically (Lin, 1998), or used in any other application of text that might benefit from a measure of lexical semantic similarity. And synonymy is a logical first step in a broader research program that seeks to account for natural language semantics through distributional means.</Paragraph> <Paragraph position="1"> Previous research into corpus-analytic approaches to synonymy has used the Test of English as a Foreign Language (TOEFL). The TOEFL consists of 300 multiple-choice question, each question involving five words: the problem or target word and four response words, one of which is a synonym of the target. The objective is to identify the synonym (call this the answer word, and call the other response words decoys). In the context of research into lexical semantics, we seek a distance function which as reliably as possible orders the answer word in front of the decoys.</Paragraph> <Paragraph position="2"> Landauer and Dumais first proposed the TOEFL as a test of lexical semantic similarity and reported a score of 64.4% on an 80-question version of the TOEFL, a score nearly identical to the average score of human test takers (Landauer and Dumais, 1997).</Paragraph> <Paragraph position="3"> Subsequently, Sahlgren reported a score of 72.0% on the same test using &quot;random indexing&quot; and a different training corpus (Sahlgren, 2001). By analyzing a much larger corpus, Ehlert was able to score 82% on a 300-question version of the TOEFL, using a simple distribution over contextual words (Ehlert, 2003).</Paragraph> <Paragraph position="4"> While success on the TOEFL does not immediately guarantee success in real-word applications requiring lexical similarity judgments, the scores have an intuitive appeal. They are easily interpretable, and the expected performance of a random guesser (25%) and typical human performance are both known. Nevertheless, the TOEFL is problematic in at least two ways. On the one hand, because it involves so few questions, conclusions based on the TOEFL regarding closely competing approaches are suspect. Even on the 300-question TOEFL, a score of 82% is accurate only to within plus or minus 4% at the 95% confidence level. The other shortcoming is a potential mis-match between the test vocabulary and the corpus vocabulary. Typically, a substantial number of questions include words observed too infrequently in the training corpus for a semantic judgment to be made with any confidence.</Paragraph> <Paragraph position="5"> We seek to overcome these difficulties by generating TOEFL-like tests automatically from Word-Net (Fellbaum, 1998). While WordNet has been used before to evaluate corpus-analytic approaches to lexical similarity (Lin, 1998), the metric proposed in that study, while useful for comparative purposes, lacks an intuitive interpretation. In contrast, we emulate the TOEFL using WordNet and inherit the TOEFL's easy interpretability.</Paragraph> <Paragraph position="6"> Given a corpus, we first derive a list of words occurring with sufficient marginal frequency to support a distributional comparison. We then use Word-Net to generate a large set of questions identical in format to those in the TOEFL. For a vocabulary of reasonable size, this yields questions numbering in the thousands. While the resulting questions differ in some interesting ways from those in the TOEFL (see below), their sheer number supports more confident conclusions. Beyond this, we can partition them by part of speech or degree of polysemy, enabling some analyses not supported by the original TOEFL.</Paragraph> </Section> class="xml-element"></Paper>