XML Viewer - w04-3240

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3240_metho.xml
Size: 19,287 bytes
Last Modified: 2025-10-06 14:09:29
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3240">
  <Title>Learning to Classify Email into &amp;quot;Speech Acts&amp;quot;</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 An Ontology of Email Acts
</SectionTitle>
    <Paragraph position="0"> Our ontology of nouns and verbs covering some of the possible speech acts associated with emails is summarized in Figure 2. We assume that a single email message may contain multiple acts, and that each act is described by a verb-noun pair drawn from this ontology (e.g., &amp;quot;deliver data&amp;quot;). The underlined nodes in the figure indicate the nouns and verbs for which we have trained classifiers (as discussed in subsequent sections).</Paragraph>
    <Paragraph position="1"> To define the noun and verb ontology of Figure 2, we first examined email from several corpora (including our own inboxes) to find regularities, and then performed a more detailed analysis of one corpus. The ontology was further refined in the process of labeling the corpora described below.</Paragraph>
    <Paragraph position="2"> In refining this ontology, we adopted several principles. First, we believe that it is more important for the ontology to reflect observed linguistic behavior than to reflect any abstract view of the space of possible speech acts. As a consequence, the taxonomy of verbs contains concepts that are atomic linguistically, but combine several illocutionary points. (For example, the linguistic unit &amp;quot;let's do lunch&amp;quot; is both directive, as it requests the receiver, and commissive, as it implicitly commits the sender. In our taxonomy this is a single 'propose' act.) Also, acts which are abstractly possible but not observed in our data are not represented (for instance, declarations).</Paragraph>
    <Paragraph position="3">  Second, we believe that the taxonomy must reflect common non-linguistic uses of email, such as the use of email as a mechanism to deliver files. We have grouped this with the linguistically similar speech act of delivering information.</Paragraph>
    <Paragraph position="4"> The verbs in Figure 1 are defined as follows.</Paragraph>
    <Paragraph position="5"> A request asks (or orders) the recipient to perform some activity. A question is also considered a request (for delivery of information).</Paragraph>
    <Paragraph position="6"> A propose message proposes a joint activity, i.e., asks the recipient to perform some activity and commits the sender as well, provided the recipient agrees to the request. A typical example is an email suggesting a joint meeting.</Paragraph>
    <Paragraph position="7"> An amend message amends an earlier proposal.</Paragraph>
    <Paragraph position="8"> Like a proposal, the message involves both a commitment and a request. However, while a proposal is associated with a new task, an amendment is a suggested modification of an already-proposed task.</Paragraph>
    <Paragraph position="9"> A commit message commits the sender to some future course of action, or confirms the senders' intent to comply with some previously described course of action.</Paragraph>
    <Paragraph position="10"> A deliver message delivers something, e.g., some information, a PowerPoint presentation, the URL of a website, the answer to a question, a message sent &amp;quot;FYI&amp;quot;, or an opinion.</Paragraph>
    <Paragraph position="11"> The refuse, greet, and remind verbs occurred very infrequently in our data, and hence we did not attempt to learn classifiers for them (in this initial study). The primary reason for restricting ourselves in this way was our expectation that human annotators would be slower and less reliable if given a more complex taxonomy.</Paragraph>
    <Paragraph position="12"> The nouns in Figure 2 constitute possible objects for the email speech act verbs. The nouns fall into two broad categories.</Paragraph>
    <Paragraph position="13"> Information nouns are associated with email speech acts described by the verbs Deliver, Remind and Amend, in which the email explicitly contains information. We also associate information nouns with the verb Request, where the email contains instead a description of the needed information (e.g., &amp;quot;Please send your birthdate.&amp;quot; versus &amp;quot;My birthdate is ...&amp;quot;. The request act is actually for a 'deliver information' activity). Information includes data believed to be fact as well as opinions, and also attached data files.</Paragraph>
    <Paragraph position="14"> Activity nouns are generally associated with email speech acts described by the verbs Propose, Request, Commit, and Refuse. Activities include meetings, as well as longer term activities such as committee memberships.</Paragraph>
    <Paragraph position="15"> Notice every email speech act is itself an activity. The &lt;verb&gt;&lt;noun&gt; node in Figure 1 indicates that any email speech act can also serve as the noun associated with some other email speech act. For example, just as (deliver information) is a legitimate speech act, so is (commit (deliver information)). Automatically constructing such nested speech acts is an interesting and difficult topic; however, in the current paper we consider only the problem of determining top-level the verb for such compositional speech acts. For instance, for a message containing a (commit (deliver information)) our goal would be to automatically detect the commit verb but not the inner (deliver information) compound noun.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="112" type="metho">
    <SectionTitle>
4 Categorization Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Corpora
</SectionTitle>
      <Paragraph position="0"> Although email is ubiquitous, large and realistic email corpora are rarely available for research purposes. The limited availability is largely due to privacy issues: for instance, in most US academic institutions, a users' email can only be distributed to researchers if all senders of the email also provided explicit written consent.</Paragraph>
      <Paragraph position="1"> The email corpora used in our experiments consist of four different email datasets collected from working groups who signed agreements to make their email accessible to researchers. The first three datasets, N01F3, N02F2, and N03F2 are annotated subsets of a larger corpus, the CSpace email corpus, which contains approximately 15,000 email messages collected from a management course at Carnegie Mellon University. In this course, 277 MBA students, organized in approximately 50 teams of four to six members, ran simulated companies in different market scenarios over a 14-week period (Kraut et al.).</Paragraph>
      <Paragraph position="2"> N02F2, N01F3 and N03F2 are collections of all email messages written by participants from three different teams, and contain 351, 341 and 443 different email messages respectively.</Paragraph>
      <Paragraph position="3"> The fourth dataset, the PW CALO corpus, was generated during a four-day exercise conducted at SRI specifically to generate an email corpus.</Paragraph>
      <Paragraph position="4"> During this time a group of six people assumed different work roles (e.g. project leader, finance manager, researcher, administrative assistant, etc) and performed a number of group activities.</Paragraph>
      <Paragraph position="5"> There are 222 email messages in this corpus.</Paragraph>
      <Paragraph position="6"> These email corpora are all task-related, and associated with a small working group, so it is not surprising that they contain many instances of the email acts described above--for instance, the CSpace corpora contain an average of about 1.3 email verbs per message. Informal analysis of other personal inboxes suggests that this sort of email is common for many university users. We believe that negotiation of shared tasks is a central use of email in many work environments.</Paragraph>
      <Paragraph position="7"> All messages were preprocessed by removing quoted material, attachments, and non-subject header information. This preprocessing was performed manually, but was limited to operations which can be reliably automated. The most difficult step is removal of quoted material, which we address elsewhere (Carvalho &amp; Cohen, 2004).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Inter-Annotator Agreement
</SectionTitle>
      <Paragraph position="0"> Each message may be annotated with several labels, as it may contain several speech acts. To evaluate inter-annotator agreement, we doublelabeled N03F2 for the verbs Deliver, Commit, Request, Amend, and Propose, and the noun, Meeting, and computed the kappa statistic (Carletta, 1996) for each of these, defined as</Paragraph>
      <Paragraph position="2"> where A is the empirical probability of agreement on a category, and R is the probability of agreement for two annotators that label documents at random (with the empirically observed frequency of each label). Hence kappa ranges from -1 to +1.</Paragraph>
      <Paragraph position="3"> The results in Table 1 show that agreement is good, but not perfect.</Paragraph>
      <Paragraph position="4">  We also took doubly-annotated messages which had only a single verb label and constructed the 5-class confusion matrix for the two annotators shown in Table 2. Note kappa values are somewhat higher for the shorter one-act messages. null</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="112" type="sub_section">
      <SectionTitle>
4.3 Learnability of Categories
</SectionTitle>
      <Paragraph position="0"> Representation of documents. To assess the types of message features that are most important for prediction, we adopted Support Vector Machines (Joachims, 2001) as our baseline learning method, and a TFIDF-weighted bag-of-words as a baseline representation for messages. We then conducted a series of experiments with the N03F2 corpus only to explore the effect of different representations.</Paragraph>
      <Paragraph position="1">  We noted that the most discriminating words for most of these categories were common words, not the low-to-intermediate frequency words that are most discriminative in topical classification. This suggested that the TFIDF weighting was inappropriate, but that a bigram representation might be more informative. Experiments showed that adding bigrams to an unweighted bag of words representation slightly improved performance, especially on Deliver. These results are shown in Table 4 on the rows marked &amp;quot;no tfidf&amp;quot; and &amp;quot;bigrams&amp;quot;. (The TFIDF-weighted SVM is shown in the row marked &amp;quot;baseline&amp;quot;, and the majority classifier in the row marked &amp;quot;default&amp;quot;; all numbers are F1 measures on 10-fold crossvalidation.) Examination of messages suggested other possible improvements. Since much negotiation involves timing, we ran a hand-coded extractor for time and date expressions on the data, and extracted as features the number of time expressions in a message, and the words that occurred near a time (for instance, one such feature is &amp;quot;the word 'before' appears near a time&amp;quot;). These results appear in the row marked &amp;quot;times&amp;quot;. Similarly, we ran a part of speech (POS) tagger and added features for words appearing near a pronoun or proper noun (&amp;quot;personPhrases&amp;quot; in the table), and also added POS counts.</Paragraph>
      <Paragraph position="2"> To derive a final representation for each category, we pooled all features that improved performance over &amp;quot;no tfidf&amp;quot; for that category. We then compared performance of these document representations to the original TFIDF bag of words baseline on the (unexamined) N02F2 and N01F3 corpora. As Table 3 shows, substantial improvement with respect to F1 and kappa was obtained by adding these additional features over the baseline representation. This result contrasts with previous experiments with bigrams for topical text classification (Scott &amp; Matwin, 1999) and sentiment detection (Pang et al., 2002). The difference is probably that in this task, more informative words are potentially ambiguous: for instance, &amp;quot;will you&amp;quot; and &amp;quot;I will&amp;quot; are correlated with requests and commitments, respectively, but the individual words in these bigrams are less predictive.</Paragraph>
      <Paragraph position="3"> Learning methods. In another experiment, we fixed the document representation to be unweighted word frequency counts and varied the learning algorithm. In these experiments, we pooled all the data from the four corpora, a total of 9602 features in the 1357 messages, and since the nouns and verbs are not mutually exclusive, we formulated the task as a set of several binary classification problems, one for each verb.</Paragraph>
      <Paragraph position="4"> The following learners were used from the Based on the MinorThird toolkit (Cohen, 2004).</Paragraph>
      <Paragraph position="5"> VP is an implementation of the voted perceptron algorithm (Freund &amp; Schapire, 1999). DT is a simple decision tree learning system, which learns trees of depth at most five, and chooses splits to maximize the function ( )  [?]+[?]+ + WWWW suggested by Schapire and Singer (1999) as an appropriate objective for &amp;quot;weak learners&amp;quot;. AB is an implementation of the confidence-rated boosting method described by Singer and Schapire (1999), used to boost the DT algorithm 10 times. SVM is a support vector machine with a linear kernel (as used above).</Paragraph>
      <Paragraph position="6">  verbs, using 5-fold cross-validation to assess accuracy. One surprise was that DT (which we had intended merely as a base learner for AB) works surprisingly well for several verbs, while AB seldom improves much over DT. We conjecture that the bias towards large-margin classifiers that is followed by SVM, AB, and VP (and which has been so successful in topic-oriented text classification) may be less appropriate for this task, perhaps because positive and negative classes are not clearly separated (as suggested by substantial inter-annotator disagreement).</Paragraph>
      <Paragraph position="7">  Further results are shown in Figure 3-5, which provide precision-recall curves for many of these classes. The lowest recall level in these graphs corresponds to the precision of random guessing.</Paragraph>
      <Paragraph position="8"> These graphs indicate that high-precision predictions can be made for the top-level of the verb hierarchy, as well as verbs Request and Deliver, if one is willing to slightly reduce recall.</Paragraph>
      <Paragraph position="9">  Transferability. One important question involves the generality of these classifiers: to what range of corpora can they be accurately applied? Is it possible to train a single set of email-act classifiers that work for many users, or is it necessary to train individual classifiers for each user? To explore this issue we trained a DT classifier for Directive emails on the NF01F3 corpus, and tested it on the NF02F2 corpus; trained the same classifier on NF02F2 and tested it on NF01F3; and also performed a 5-fold cross-validation experiment within each corpus.</Paragraph>
      <Paragraph position="10"> (NF02F2 and NF01F3 are for disjoint sets of users, but are approximately the same size.) We then performed the same experiment with VP for Deliver verbs and SVM for Commit verbs (in each case picking the top-performing learner with respect to F1). The results are shown in Table 5.</Paragraph>
      <Paragraph position="11">  If learned classifiers were highly specific to a particular set of users, one would expect that the diagonal entries of these tables (the ones based on cross-validation within a corpus) would exhibit much better performance than the off-diagonal entries. In fact, no such pattern is shown. For Directive verbs, performance is similar across all table entries, and for Deliver and Commit, it seems to be somewhat better to train on NF01F3 regardless of the test set.</Paragraph>
    </Section>
    <Section position="4" start_page="112" end_page="112" type="sub_section">
      <SectionTitle>
4.4 Future Directions
</SectionTitle>
      <Paragraph position="0"> None of the algorithms or representations discussed above take into account the context of an email message, which intuitively is important in detecting implicit speech acts. A plausible notion of context is simply the preceding message in an email thread.</Paragraph>
      <Paragraph position="1"> Exploiting this context is non-trivial for several reasons. Detecting threads is difficult; although email headers contain a &amp;quot;reply-to&amp;quot; field, users often use the &amp;quot;reply&amp;quot; mechanism to start what is intuitively a new thread. Also, since email is asynchronous, two or more users may reply simultaneously to a message, leading to a thread structure which is a tree, rather than a sequence. Finally, most sequential learning models assume a single category is assigned to each instance--e.g., (Ratnaparkhi, 1999)--whereas our scheme allows multiple categories.</Paragraph>
      <Paragraph position="2"> Classification of emails according to our verb-noun ontology constitutes a special case of a general family of learning problems we might call factored classification problems, as the classes (email speech acts) are factored into two features (verbs and nouns) which jointly determine this class. A variety of real-world text classification problems can be naturally expressed as factored problems, and from a theoretical viewpoint, the additional structure may allow construction of new, more effective algorithms.</Paragraph>
      <Paragraph position="3"> For example, the factored classes provide a more elaborate structure for generative probabilistic models, such as those assumed by Naive Bayes. For instance, in learning email acts, one might assume words were drawn from a mixture distribution with one mixture component produces words conditioned on the verb class factor, and a second mixture component generates words conditioned on the noun (see Blei et al (2003) for a related mixture model). Alternatively, models of the dependencies between the different factors (nouns and verbs) might also be used to improve classification accuracy, for instance by building into a classifier the knowledge that some nouns and verbs are incompatible.</Paragraph>
      <Paragraph position="4"> The fact that an email can contain multiple email speech acts almost certainly makes learning more difficult: in fact, disagreement between human annotators is generally higher for longer messages. This problem could be addressed by more detailed annotation: rather than annotating each message with all the acts it contains, human annotators could label smaller message segments (say, sentences or paragraphs). An alternative to more detailed (and expensive) annotation would be to use learning algorithms that implicitly segment a message. As an example, another mixture model formulation might be used, in which each mixture component corresponds to a single verb category.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML