File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-2005_intro.xml

Size: 3,982 bytes

Last Modified: 2025-10-06 14:02:22

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-2005">
  <Title>Automatic Acquisition of English Topic Signatures Based on a Second Language</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Lexical knowledge is crucial for many NLP tasks.</Paragraph>
    <Paragraph position="1"> Huge efforts and investments have been made to build repositories with different types of knowledge. Many of them have proved useful, such as WordNet (Miller et al., 1990). However, in some areas, such as WSD, manually created knowledge bases seem never to satisfy the huge requirement by supervised machine learning systems. This is the so-called knowledge acquisition bottleneck.</Paragraph>
    <Paragraph position="2"> As an alternative, automatic or semi-automatic acquisition methods have been proposed to tackle the bottleneck. For example, Agirre et al. (2001) tried to automatically extract topic signatures by querying a search engine using monosemous synonyms or other knowledge associated with a concept defined in WordNet.</Paragraph>
    <Paragraph position="3"> The Web provides further ways of overcoming the bottleneck. Mihalcea et al. (1999) presented a method enabling automatic acquisition of sense-tagged corpora, based on WordNet and an Internet search engine. Chklovski and Mihalcea (2002) presented another interesting proposal which turns to Web users to produce sense-tagged corpora.</Paragraph>
    <Paragraph position="4"> Another type of method, which exploits differences between languages, has shown great promise. For example, some work has been done based on the assumption that mappings of words and meanings are different in different languages.</Paragraph>
    <Paragraph position="5"> Gale et al. (1992) proposed a method which automatically produces sense-tagged data using parallel bilingual corpora. Diab and Resnik (2002) presented an unsupervised method for WSD using the same type of resource. One problem with relying on bilingual corpora for data collection is that bilingual corpora are rare, and aligned bilingual corpora are even rarer. Mining the Web for bilingual text (Resnik, 1999) is not likely to provide sufficient quantities of high quality data. Another problem is that if two languages are closely related, data for some words cannot be collected because different senses of polysemous words in one language often translate to the same word in the other.</Paragraph>
    <Paragraph position="6"> In this paper, we present a novel approach for automatically acquiring topic signatures (see Table 1 for an example of topic signatures), which also adopts the cross-lingual paradigm. To solve the problem of different senses not being distinguishable mentioned in the previous paragraph, we chose a language very distant to English -Chinese, since the more distant two languages are, the more likely that senses are lexicalised differently (Resnik and Yarowsky, 1999). Because our approach only uses Chinese monolingual text, we also avoid the problem of shortage of aligned bilingual corpora. We build the topic signatures by using Chinese-English and English-Chinese bilingual lexicons and a large amount of Chinese text, which can be collected either from the Web or from Chinese corpora. Since topic signatures are potentially good training data for WSD algorithms, we set up a task to disambiguate 6 words using a WSD algorithm similar to Sch&amp;quot;utze's (1998) context-group discrimination. The results show that our topic signatures are useful for WSD.</Paragraph>
    <Paragraph position="7"> The remainder of the paper is organised as follows. Section 2 describes the process of acquisition of the topic signatures. Section 3 demonstrates the application of this resource on WSD, and presents the results of our experiments. Section 4 discusses factors that could affect the acquisition process and then we conclude in Section 5.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML