XML Viewer - w05-0403

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0403_metho.xml
Size: 6,885 bytes
Last Modified: 2025-10-06 14:09:55
<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0403">
  <Title>Temporal Feature Modification for Retrospective Categorization</Title>
  <Section position="4" start_page="18" end_page="19" type="metho">
    <SectionTitle>
3 Methods
</SectionTitle>
    <Paragraph position="0"> Text categorization (TC) is the problem of assigning documents to one or more pre-defined categories. As Section 1 demonstrated, the terms which best characterize a category can change through time, so it is not unreasonable to assume that intelligent use of temporal context will prove useful in TC.</Paragraph>
    <Paragraph position="1"> Imagine the example of sorting several decades of articles from the Los Angeles Times into the categories ENTERTAINMENT, BUSINESS, SPORTS, POL-ITICS, and WEATHER. Suppose we come across the term schwarzenegger in a training document. In the 1970s, during his career as a professional bodybuilder, Arnold Schwarzenegger's name would be a strong indicator of a SPORTS document. During his film career in the 1980s-1990s, his name would be most likely to appear in an ENTERTAINMENT document. After 2003, at the outset of his term as California's governor, the POLITICS and BUSINESS categories would be the most likely candidates. We refer to schwarzeneggeras a temporally perturbed term, because its distribution across categories varies greatly with time.</Paragraph>
    <Paragraph position="2"> Documents containing temporally perturbed terms hold valuable information, but this is lost in a statistical analysis based purely on the average distribution of terms across categories, irrespective of temporal context. This information can be recovered with a technique we call temporal feature modification (TFM). We first outline a formal model for its use.</Paragraph>
    <Section position="1" start_page="18" end_page="19" type="sub_section">
      <SectionTitle>
3.1 A term generator framework
</SectionTitle>
      <Paragraph position="0"> One obvious way to introduce temporal information into the categorization task is to simply provide the year of publication as a new lexical feature. Preliminary experiments (not reported here) showed that this method had virtually no effect on classification performance. When the date features were &amp;quot;emphasized&amp;quot; with higher frequencies, classification performance declined.</Paragraph>
      <Paragraph position="1"> Instead, we proceed from the perspective of a simplified language generator model (e.g. (Blei et al., 2003)). We imagine that the first step in the production of a document involves an author choosing a category C. Each term k (word, bigram, phrase, etc.) is accorded a unique generator Ga0 that determines the distribution of k across categories, and therefore its likelihood to appear in category C. The model assumes that all authors share the same generator for each term, and that the generators do not change over time. We are particularly interested in identifying temporally perturbed lexical generators that violate this assumption.</Paragraph>
      <Paragraph position="2"> External events at time t can perturb the generator ofk, causing Pr(C|ka1 ) to be different relative to the background Pr(C|k) computed over the entire corpus. If the perturbation is significant, we want to separate the instances of k at time t from all other instances.</Paragraph>
      <Paragraph position="3"> Returning to our earlier example, we would treat a generic, atemporal occurrence of schwarzenegger and the pseudo-term &amp;quot;schwarzenegger+2003&amp;quot; as though they were actually different terms, because they were produced by two different generators. We hypothesize that separating the analysis of the two can improve  our estimates of the true Pr(C|k), both in 2003 and in other years.</Paragraph>
    </Section>
    <Section position="2" start_page="19" end_page="19" type="sub_section">
      <SectionTitle>
3.2 TFM Procedure
</SectionTitle>
      <Paragraph position="0"> The generator model motivates a procedure we outline below for flagging certain lexemes with explicit temporal information that distinguish them so as to contrast them with those generated by the underlying atemporal alternatives. This procedure makes use of the (log) odds ratio for feature selection:</Paragraph>
      <Paragraph position="2"> where p is Pr(k|C), the probability that term k is present, given category C, and q is Pr(k|!C).</Paragraph>
      <Paragraph position="3"> The odds ratio between a term and a category is a measure of the associated strength of the two, for it measures the likelihood that a term will occur frequently within a category and (relatively) infrequently outside. Odds ratio happens to perform very well in feature selection tests; see (Mladenic, 1998) for details on its use and variations.</Paragraph>
      <Paragraph position="4"> Ultimately, it is an arbitrary choice and could be replaced by any method that measures term-category strength.</Paragraph>
      <Paragraph position="5"> The following pseudocode describes the process of temporal feature modification:</Paragraph>
      <Paragraph position="7"> for each term k in ModifyList(t): Add pseudo-term &amp;quot;k+t&amp;quot; to Vocab</Paragraph>
      <Paragraph position="9"> the odds ratio measure, are highly associated with category C at time t. (In our case, time is divided annually, because this is the finest resolution we have for many of the documents in our corpus.) We test the hypothesis that these come from a perturbed generator at time t, as opposed to the atemporal generator Ga0 , by comparing the odds ratios of term-category pairs in a PreModList at time t with the same pairs across the entire corpus. Terms which pass this test are added to the final ModifyList(t) for time t. For the results that we report, DecisionRule is a simple ratio test with threshold factor f. Suppose f is 2.0: if the odds ratio between C and k is twice as great at time t as it is atemporally, the decision rule is &amp;quot;passed&amp;quot;.  The generator Ga0 is then considered perturbed at time t and k is added to ModifyList(t). In the training and testing phases, the documents are modified so that a termk is replaced with the pseudo-term &amp;quot;k+t&amp;quot; if it passed the ratio test.</Paragraph>
    </Section>
    <Section position="3" start_page="19" end_page="19" type="sub_section">
      <SectionTitle>
3.3 Text categorization details
</SectionTitle>
      <Paragraph position="0"> The TC parameters held constant in our experiments are: Stoplist, Porter stemming, and Laplacian smoothing.</Paragraph>
      <Paragraph position="1"> Other parameters were varied: four different classifiers, three unique minimum vocabulary frequencies, unigrams and bigrams, and four threshold factors f. 10-fold cross validation was used for parameter selection, and 10% of the corpus was held out for testing purposes. Both of these sets were distributed evenly across time.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML