File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0404_metho.xml

Size: 13,177 bytes

Last Modified: 2025-10-06 14:09:54

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0404">
  <Title>Using Semantic and Syntactic Graphs for Call Classi cation</Title>
  <Section position="4" start_page="26" end_page="28" type="metho">
    <SectionTitle>
4 Computation of the SSGs
</SectionTitle>
    <Paragraph position="0"> In this section, the tools used to compute the information in SSGs are described and their performances on manually transcribed spoken dialog utterances are presented. All of these components may be improved independently, for the speci c application domain.</Paragraph>
    <Section position="1" start_page="26" end_page="27" type="sub_section">
      <SectionTitle>
4.1 Part-of-Speech Tagger
</SectionTitle>
      <Paragraph position="0"> Part-of-speech tagging has been very well studied in the literature for many languages, and the approaches vary from rule-based to HMM-based and classi er-based (Church, 1988; Brill, 1995, among others) tagging. In our framework, we employ a simple HMM-based tagger, where the most probable tag sequence, a29a30 , given the words, a31 , is out-</Paragraph>
      <Paragraph position="2"> Since we do not have enough data which is manually tagged with part-of-speech tags for our applications, we used Penn Treebank (Marcus et al., 1994) as our training set. Penn Treebank includes data from Wall Street Journal, Brown, ATIS, and Switchboard corpora. The nal two sets are the most useful for our domain, since they are also from spoken language and include dis uencies. As a test set, we manually labeled 2,000 words of user utterances from an AT&amp;T VoiceTone spoken dialog system application, and we achieved an accuracy of 94.95% on manually transcribed utterances. When we examined the errors, we have seen that the frequent word please is mis-labeled or frequently occurs as a verb in the training data, even when it is not. Given that the latest literature on POS tagging using Penn Treebank reports an accuracy of around 97% with in-domain</Paragraph>
    </Section>
    <Section position="2" start_page="27" end_page="27" type="sub_section">
      <SectionTitle>
4.2 Syntactic Parser
</SectionTitle>
      <Paragraph position="0"> For syntactic parsing, we use the Collins' parser (Collins, 1999), which is reported to give over 88% labeled recall and precision on Wall Street Journal portion of the Penn Treebank.</Paragraph>
      <Paragraph position="1"> We use Buchholz's chunklink script to extract information from the parse trees3. Since we do not have any data from our domain, we do not have a performance gure for this task for our domain.</Paragraph>
    </Section>
    <Section position="3" start_page="27" end_page="27" type="sub_section">
      <SectionTitle>
4.3 Named Entity Extractor
</SectionTitle>
      <Paragraph position="0"> For named entity extraction, we tried using a simple HMM-based approach, a simpli ed version of BBN's name nder (Bikel et al., 1999), and a classi er-based tagger using Boostexter (Schapire and Singer, 2000). In the simple HMM-based approach, which is the same as the part-of-speech tagging, the goal is to nd the tag sequence, a29a30 , which</Paragraph>
      <Paragraph position="2"> a31a49a48 for the word sequence, a31 . The tags in this case are named entity categories (such as P and p for Person names, O and o for Organization names, etc. where upper-case indicates the rst word in the named entity) or NA if the word is not a part of a named entity. In the simpli ed version of BBN's name nder, the states of</Paragraph>
      <Paragraph position="4"> the model were word/tag combinations, where the tag a53a54a13 for word a55a56a13 is the named entity category of each word. Transition probabilities consisted of tri-gram probabilities</Paragraph>
      <Paragraph position="6"> over these combined tokens. In the nal version, we extended this model with an unknown words model (Hakkani-Tcurrency1ur et al., 1999). In the classi er-based approach, we used simple features such as the current word and surrounding 4 words, binary tags indicating if the word considered contains any digits or is formed from digits, and features checking capitalization (Carreras et al., 2003).</Paragraph>
      <Paragraph position="7"> To test these approaches, we have used data from an AT&amp;T VoiceTone spoken dialog system application for a pharmaceutical domain, where some of the named entity categories were person, organization, drug name, prescription number, and date. The training and test sets contained around 11,000 and 5,000 utterances, respectively. Table 1 summarizes the overall F-measure results as well as F-measure for the most frequent named entity categories. Overall, the classi er based approach resulted in the best performance, so it is also used for the call classi cation experiments.</Paragraph>
    </Section>
    <Section position="4" start_page="27" end_page="28" type="sub_section">
      <SectionTitle>
4.4 Semantic Role Labeling
</SectionTitle>
      <Paragraph position="0"> The goal of semantic role labeling is to extract all the constituents which ll a semantic role of a target verb. Typical semantic arguments include Agent, Patient, Instrument, etc. and also adjuncts such as Locative, Temporal, Manner, Cause, etc. In this  tion with various approaches. HMM is the simple HMM-based approach, IF is the simpli ed version of BBN's name nder with an unknown words model.</Paragraph>
      <Paragraph position="1"> work, we use the semantic roles and annotations from the PropBank corpus (Kingsbury et al., 2002), where the arguments are given mnemonic names, such as Arg0, Arg1, Arg-LOC, etc. For example, for the sentence I have bought myself a blue jacket from your summer catalog for twenty ve dollars last week, the agent (buyer, or Arg0) is I, the predicate is buy, the thing bought (Arg1) is a blue jacket, the seller or source (Arg2) is from your summer catalog, the price paid (Arg3) is twenty ve dollars, the benefactive (Arg4) is myself, and the date (ArgM-TMP) is last week4.</Paragraph>
      <Paragraph position="2"> Semantic role labeling can be viewed as a multi-class classi cation problem. Given a word (or phrase) and its features, the goal is to output the most probable semantic label. For semantic role labeling, we have used the exact same feature set that Hacioglu et al. (2004) have used, since their system performed the best among others in the CoNLL-2004 shared task (Carreras and M arquez, 2004). We have used Boostexter (Schapire and Singer, 2000) as the classi er. The features include token-level features (such as the current (head) word, its part-of-speech tag, base phrase type and position, etc.), predicate-level features (such as the predicate's lemma, frequency, part-of-speech tag, etc.) and argument-level features which capture the relationship between the token (head word/phrase) and the predicate (such as the syntactic path between the token and the predicate, their distance, token position relative to the predicate, etc.).</Paragraph>
      <Paragraph position="3"> In order to evaluate the performance of semantic role labeling, we have manually annotated 285 utterances from an AT&amp;T VoiceTone spoken dialog sys- null tem application for a retail domain. The utterances include 645 predicates (2.3 predicates/utterance).</Paragraph>
      <Paragraph position="4"> First we have computed recall and precision rates for evaluating the predicate identi cation performance.</Paragraph>
      <Paragraph position="5"> The precision is found to be 93.04% and recall is 91.16%. More than 90% of false alarms for predicate extraction are due to the word please, which is very frequent in customer care domain and erroneously tagged as explained above. Most of the false rejections are due to dis uencies and ungrammatical utterances. For example in the utterance I'd like to order place an order, the predicate place is tagged as a noun erroneously, probably because of the preceding verb order. Then we have evaluated the argument labeling performance. We have used a stricter measure than the CoNLL-2004 shared task. The labeling is correct if both the boundary and the role of all the arguments of a predicate are correct. In our test set, we have found out that our SRL tool correctly tags all arguments of 57.4% of the predicates.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="28" end_page="30" type="metho">
    <SectionTitle>
5 Call Classi cation Experiments and
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="28" end_page="28" type="sub_section">
      <SectionTitle>
Results
</SectionTitle>
      <Paragraph position="0"> In order to evaluate our approach, we carried out call classi cation experiments using human-machine dialogs collected by the spoken dialog system used for customer care. We have only considered utterances which are responses to the greeting prompt How may I help you? in order not to deal with con rmation and clari cation utterances. We rst describe this data, and then give the results obtained by the semantic classi er. We have performed our tests using the Boostexter tool, an implementation of the Boosting algorithm, which iteratively selects the most discriminative features for a given task (Schapire and Singer, 2000).</Paragraph>
    </Section>
    <Section position="2" start_page="28" end_page="28" type="sub_section">
      <SectionTitle>
5.1 Data
</SectionTitle>
      <Paragraph position="0"> Table 2 summarizes the characteristics of our application including the amount of training and test data, total number of call-types, average utterance length, and call-type perplexity. Perplexity is computed using the prior distribution over all the call-types in the training data.</Paragraph>
    </Section>
    <Section position="3" start_page="28" end_page="30" type="sub_section">
      <SectionTitle>
5.2 Results
</SectionTitle>
      <Paragraph position="0"> For call classi cation, we have generated SSGs for the training and test set utterances using the tools  described above. When a2 -grams are extracted from these SSGs, instead of the word graphs (Baseline), there is a huge increase in the number of features given to the classi er, as seen in Table 3. The classi er has now 15 times more features to work with. Although one can apply a feature selection approach before classi cation as frequently done in the machine learning community, we left the burden of analyzing 825,201 features to the classi er.</Paragraph>
      <Paragraph position="1"> Table 4 presents the percentage of the features selected by Boostexter using SSGs for each information category. As expected the lexical information is the most frequently used, and 54.06% of the selected features have at least one word in its a2 -gram. The total is more than 100%, since some features contain more than one category, as in the bigram feature example: POS:DT WORD:bill. This shows the use of other information sources as well as words.</Paragraph>
      <Paragraph position="2"> Table 5 presents our results for call classi cation.</Paragraph>
      <Paragraph position="3"> As the evaluation metric, we use the top class error rate (TCER), which is the ratio of utterances, where the top scoring call-type is not one of the true call-types assigned to each utterance by the human labelers. The baseline TCER on the test set using only word a2 -grams is 23.80%. When we extract features from the SSGs, we see a 2.14% relative decrease in the error rate down to 23.29%. When we analyze these results, we have seen that: a6 For easy to classify utterances, the classi er already assigns a high score to the true call-type  and SSGs.</Paragraph>
      <Paragraph position="4"> using just word a2 -grams.</Paragraph>
      <Paragraph position="5"> a6 The syntactic and semantic features extracted from the SSGs are not 100% accurate, as presented earlier. So, although many of these features have been useful, there is certain amount of noise introduced in the call classi cation training data.</Paragraph>
      <Paragraph position="6"> a6 The particular classi er we use, namely Boosting, is known to handle large feature spaces poorer than some others, such as SVMs. This is especially important with 15 times more features. null Due to this analysis, we have focused on a sub-set of utterances, namely utterances with low con dence scores, i.e. cases where the score given to the top scoring call-type by the baseline model is below a certain threshold. In this subset we had 333 utterances, which is about 17% of the test set. As expected the error rates are much higher than the overall and we get much larger improvement in performance when we use SSGs. The baseline for this set is 68.77%, and using extra features, this reduces to 62.16% which is a 9.61% relative reduction in the error rate.</Paragraph>
      <Paragraph position="7"> This nal experiment suggests a cascaded approach for exploiting SSGs for call classi cation.  That is, rst the baseline word a2 -gram based classi er is used to classify all the utterances, then if this model fails to commit on a call-type, we perform extra feature extraction using SSGs, and use the classi cation model trained with SSGs. This cascaded approach reduced the overall error rate of all utterances from 23.80% to 22.67%, which is 4.74% relative reduction in error rate.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML