File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/00/c00-1047_concl.xml

Size: 2,742 bytes

Last Modified: 2025-10-06 13:52:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-1047">
  <Title>A Method of Measuring Term Representativeness - Baseline Method Using Co-occurrence Distribution -</Title>
  <Section position="4" start_page="325" end_page="325" type="concl">
    <SectionTitle>
5. Conclusion and future works
</SectionTitle>
    <Paragraph position="0"> We have developed a better method -- the baseline method -- for defining the representativeness of a term. A characteristic value of all docmnents containing a term T, D(T), is normalized by using a baseline function that estimates the characteristic value of a randomly chosen doculnent set of the same size as D(?). The normalized value is used to measure the representativeness of the term T, and a measure defined by the baseline method offers several advantages compared to classical measures:  (1) its definition is mathematically simple and clean (2) it can compare high-frequency terms with  low-frequency terms, (3) the threshold value for being representative can be defined systcmatically, and (4) it can be applied to n-gram terms for any n. We developed two measures: one based on the normalized distance between two word distributions (Rep(., LLR)) and another based on the number of different words in a document set (Rep( o, DIFFNUM)). We compared these measures with two classical measures from various viewpoints, and confirmed that Rep(,, LLR) was superior.</Paragraph>
    <Paragraph position="1"> Experiments showed that the newly developed measures were particularly eflizctive for discarding frequent but uninformative terms. We can expect that these measures can be used for automated construction of a stop-word list and improvement of similarity calculation of documents.</Paragraph>
    <Paragraph position="2"> An important finding was that the baseline function is portable; that is, one defined on a corpus can be used for laormalization in a diflbrent corpus even if the two corpora have considerably diftbrent sizes or are in different domains. Wc can therefore apply the measures in a practical application when dealing with multiple similar corpora whose word distribution information is not fully known but we have the inforlnation on one particular corpus.</Paragraph>
    <Paragraph position="3"> We plan to apply Rep(., LLR) and Rep(., DIFFNUM) to several tasks in IR domain, such as the construction of a stop-word list for indexing and term weighting in document-similarity calculation. It will also be interesting to theoretically estimate the baseline functions by using fundalnental parameters such as the total numbcr of words in a corpus or the total different number in the corpus. The natures of the baseline functions deserve further study.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML