File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/03/j03-3006_abstr.xml

Size: 5,235 bytes

Last Modified: 2025-10-06 13:42:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="J03-3006">
  <Title>Automatic Association of Web Directories with Word Senses</Title>
  <Section position="2" start_page="0" end_page="486" type="abstr">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> Combining the size and diversity of the textual material on the World Wide Web with the power and efficiency of current search engines is an attractive possibility for acquiring lexical information and corpora. A widespread example is spell-checking: Many Web users routinely use search engines to assess which is the &amp;quot;correct&amp;quot; (i.e. with more hits in the Web) spelling of words. Among NLP researchers, Web search engines have already been used as a point of departure for extraction of parallel corpora, automatic acquisition of sense-tagged corpora, and extraction of lexical information.</Paragraph>
    <Paragraph position="1"> Extraction of parallel corpora. In Resnik (1999), Nie, Simard, and Foster (2001), Ma and Liberman (1999), and Resnik and Smith (2002), the Web is harvested in search of pages that are available in two languages, with the aim of building parallel corpora for any pair of target languages. This is a very promising technique, as many machine translation (MT) and cross-language information retrieval (CLIR) strategies rely on the existence of parallel corpora, which are still a scarce resource. Such Web-mined parallel corpora have proved to be useful, for instance, in the context of the CLEF (Cross-Language Evaluation Forum) CLIR competition, in which many participants use such parallel corpora (provided by the University of Montreal) to improve the performance of their systems (Peters et al. 2002).</Paragraph>
    <Paragraph position="2"> Automatic acquisition of sense-tagged corpora. The description of a word sense can be used to build rich queries in such a way that the occurrences of the word in the documents retrieved are, with some probability, associated with the desired sense.</Paragraph>
    <Paragraph position="3"> If the probability is high enough, it is then possible to acquire sense-tagged corpora [?] ETS Ingenier'ia Inform'atica de la UNED, c/ Juan del Rosal, 16, Ciudad Universitaria, 28040 Madrid, Spain. E-mail: {celina,julio,felisa}@lsi.uned.es  Computational Linguistics Volume 29, Number 3 in a fully automatic fashion. Again, this is an exciting possibility that would solve the current bottleneck of supervised word sense disambiguation (WSD) methods (namely, that sense-tagged corpora are very costly to acquire).</Paragraph>
    <Paragraph position="4"> One example of this kind of technique is Mihalcea and Moldovan (1999), in which a precision of 91% is reported over a set of 20 words with 120 senses. In spite of the high accuracy obtained, such methodology did not perform well in the comparative evaluation reported in Agirre and Mart'inez (2000), perhaps indicating that examples obtained from the Web may have topical biases (depending on the word), and that further refinement is required. For instance, a technique that behaves well with a small set of words might fail in the common cases in which a new sense is predominant on the Web (e.g., oasis or nirvana as music groups, tiger as a golfer, jaguar as a car brand).</Paragraph>
    <Paragraph position="5"> Extraction of lexical information. In Agirre et al. (2000), search engines and the Web are used to assign Web documents to WordNet concepts. The resulting sets of documents are then processed to build topic signatures, that is, sets of words with weights that enrich the description of a concept. In Grefenstette (1999), the number of hits in Web search engines is used as a source of evidence to select optimal translations for multiword expressions. For instance, apple juice is selected as a better translation than apple sap for the German ApfelSaft because apple juice hits a thousand times more documents in AltaVista. Finally, in Joho and Sanderson (2000) and Fujii and Ishikawa (1999), the Web is used as a resource to provide descriptive phrases or definitions for technical terms.</Paragraph>
    <Paragraph position="6"> A common problem to all the above applications is how to detect and filter out all the noisy material on the Web, and how to characterize the rest (Kilgarriff 2001b). Our starting hypotheses is that Web directories (e.g., Yahoo, AltaVista or Google directories, the Open Directory Project [ODP]), in which documents are mostly manually classified in hierarchical topical clusters, are an optimal source for acquiring lexical information; their size is not comparable to the full Web, but they are still enormous sources of semistructured, semifiltered information waiting to be mined.</Paragraph>
    <Paragraph position="7"> In this article, we describe an algorithm for assigning Web directories (from the Open Directory Project &lt;http://dmoz.org&gt; ) as characterizations for word senses in WordNet 1.7 noun synsets (Miller 1990). For instance, let us consider the noun circuit, which has six senses in WordNet 1.7. These senses are grouped in synsets, together with their synonym terms, and linked to broader (more general) synsets via hypernymy relations:</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML