File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/c96-1069_intro.xml

Size: 3,708 bytes

Last Modified: 2025-10-06 14:05:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-1069">
  <Title>An Automatic Clustering of Articles Using Dictionary Definitions</Title>
  <Section position="3" start_page="0" end_page="406" type="intro">
    <SectionTitle>
2 Related Work
</SectionTitle>
    <Paragraph position="0"> One of major approaches in automatic clustering of articles is based on statistical information of words in ,~rticles. Every article is characterised by a vector, each dimension of which is associated with a specific word in articles, and every coordinate of the artMe is represented by tern: weighting. Tern1 weighting methods have been widely studied in iufornmtion retrieval research (Salton, 1983), (Jones, 1972) and some of then: are used in an automatic clustering of articles. Guthrie and Yuasa used word frequencies for weighting (Guthrie, 1994), (Yuasa, 1995), and Tokunaga used weighted inverse document frequency which is a word frequency within the document divided by its fl'equency throughout the entire document collection (Tokunaga, 1994). The results of these methods when al)plied to articles' cbussification task, seem to show its etfectiveness. However, these works do not seriously deal with the 1)roblem of polysemy.</Paragraph>
    <Paragraph position="1"> The alternative al)l)roach is based on dictionary's infl)rlnation as a thesaurus. One of major problems using thesaurus ('ategories a.s sense represe::tation is a statistical sparseness for thesaurus words, since they are nmstly rather uncommon words (Niwa, 1995). Yuasa reported the experimental results when using word frequencies for weighting within large documents were better resuits in clustering (lo('unmnts as those when EDR electronic dictionary as a thesaurus (Yuasa, 1995).</Paragraph>
    <Paragraph position="2"> The technique developed by Walker also used (lietionary's infornmtion and seems to cope with the discrimination of polysemy (Walker, 1986).</Paragraph>
    <Paragraph position="3"> He used the semantic codes of the Longmau Dictionary of Contemporary English in order to determine the subject donmin for a set of texts. For a given text, each word is checked against the dictionary to determine the semantic codes associate(l with it. By accumulating the frequencies for these senses and then ordering the list, of categories in terms of frequency, the subject matter of  the text can be identified. However, ~us he admits, a phrasal lexicon, such as Atlantic Seaboard, New England gives a negative influence for clustering, since it can not be regarded ~us units, i.e. each word which is the element of a 1)hrasal lexicon is assigned to each semantic code.</Paragraph>
    <Paragraph position="4"> The approach proposed in this paper focuses on these l)roblems, i.e. 1)olysemy and a phrasal lexicon. Like Guthrie and Yuasa's methods, our approach adopts a vector representation, i.e. every article is characterised by a vector. However~ while their ~pproaehes assign each (:oor(linate of a vector to each word in artMes, we use a word (noun) of wtfich sense is disambiguated. Our disambiguation method of word-senses is based on Niwa's method whMt use(l the similarit;y 1)etween two sentences, i.e. a sentevee which contains a polysenmus noun and a sevtenee of dictionarydefinition. In order to cope with Walker's l)rob lem, for the results of disand)iguation technique, semantic relativeness of words are cMeulated, and semantically related words are grout)ed together.</Paragraph>
    <Paragraph position="5"> We used WSJ corpus as test artich,s in the experiments in order to see how our metho(l can effectively classify artMes, eacl, &lt;)f whi&lt;:h beh)ngs te the restricted subject domain, i.e. WS.I.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML