File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1041_metho.xml
Size: 21,455 bytes
Last Modified: 2025-10-06 14:08:02
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1041"> <Title>Information Extraction from Voicemail Transcripts</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Voicemail Corpus </SectionTitle> <Paragraph position="0"> Development and evaluation was done using a proprietary corpus of almost 10,000 voicemail messages that had been manually transcribed and marked up for content. Some more details about this corpus can be found in (Bacchiani, 2001). The relevant content labeling is perhaps best illustrated with an (anonymized) excerpt form a typical message transcript: <greeting> hi Jane </greeting> <caller> this is Pat Caller </caller> I just wanted to I know you've probably seen this or maybe you already know about it . . . so if you could give me a call at <telno> one two three four five </telno> when you get the message I'd like to chat about it hope things are well with you <closing> talk to you soon </closing> This transcript is representative of a large class of messages that start out with a short greeting followed by a phrase that identifies the caller either by name as above or by other means ('hi, it's me'). A phone number may be mentioned as part of the caller's self-identification, or is often mentioned near the end of the message. It may seem natural and obvious that voicemail messages should be structured in this way, and this prototypical structure can therefore be exploited for purposes of locating caller information or deciding whether a digit string constitutes a phone number. The next sections discuss this in more detail.</Paragraph> <Paragraph position="1"> The corpus was partitioned into two subsets, with 8120 messages used for development and 1869 for evaluation. Approximately 5% of all messages are empty. Empty messages were not discarded from the evaluation set since they constitute realistic samples that the information extraction component has to cope with. The development set contains 7686 non-empty messages.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Caller Information </SectionTitle> <Paragraph position="0"> Of the non-empty messages in the development set, 7065 (92%) transcripts contain a marked-up caller phrase. Of those, 6731 messages mention a name in the caller phrase. Extracting caller information can be broken down into two slightly different tasks: we might want to reproduce the existing caller annotation as closely as possible, producing caller phrases like 'this is Pat Caller' or 'it's me'; or we might only be interested in caller names such as 'Pat Caller' in our above example. We make use of the fact that for the overwhelming majority of cases, the caller's self-identification occurs somewhere near the beginning of the message.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Caller Phrases </SectionTitle> <Paragraph position="0"> Most caller phrases tend to start one or two words into the message. This is because they are typically preceded by a one-word ('hi') or two-word ('hi Jane') greeting. Figure 1 shows the empirical distribution of the beginning of the caller phrase across the 7065 applicable transcripts in the development data. As can be seen, more than 97% of all caller phrases start somewhere between one and seven words from the beginning of the message, though in one extreme case the start of the caller phrase occurred 135 words into the message.</Paragraph> <Paragraph position="1"> starting x words into the message These observations strongly suggest that when extracting caller phrases, positional cues should be taken into account. This is good news, especially since intrinsic features of the caller phrase may not be as reliable: a caller phrase is likely to contain names that are problematic for an automatic speech recognizer. While this is less of a problem when evaluating on manual transcriptions, the experience reported in (Huang et al., 2001) suggests that the relatively high error rate of speech recognizers may negatively affect performance of caller name extraction on automatically generated transcripts. We therefore avoid using anything but a small number of greetings and commonly occurring words like 'hi', 'this', 'is' etc. and a small number of common first names for extracting caller phrases and use positional information in addition to word-based features. null We locate caller phrases by first identifying their start position in the message and then predicting the length of the phrase. The empirical distribution of caller phrase lengths in the development data is shown in Figure 2. Most caller phrases are between two and four words long ('it's Pat', 'this is Pat Caller') and there are moderately good lexical indicators that signal the end of a caller phrase ('I', 'could', 'please', etc.). Again, we avoid the use of names as features and rely on a small set of features based on common words, in addition to phrase length, for predicting the length of the caller phrase. being x words long We have thus identified two classes of features that allow us to predict the start of the caller phrase relative to the beginning of the message, as well as the end of the caller phrase relative to its start. Since we are dealing with discrete word indices in both cases, we treat this as a classification task, rather than a regression task. A large number of classifier learners can be used to automatically infer classifiers for the two subtasks at hand. We chose a decision tree learner for convenience and note that this choice does not affect the overall results nearly as much as modifying our feature inventory.</Paragraph> <Paragraph position="2"> Since a direct comparison to the log-linear named entity tagger described in (Huang et al., 2001) (we refer to this approach as HZP log-linear below) is not possible due to the use of different corpora and annotation standards, we applied a similar named entity tagger based on a log-linear model with tri-gram features to our data (we refer to this approach as Col log-linear as the tagger was provided by Michael Collins). Table 1 summarizes precision (P), recall (R), and F-measure (F) for three approaches evaluated on manual transcriptions: row HZP log-linear repeats the results of the best model from (Huang et al., 2001); row Col log-linear contains the results we obtained using a similar named entity tagger on our own data; and row JA classifiers shows the performance of the classifier method proposed in this section.</Paragraph> <Paragraph position="3"> Like Huang et al. (2001), we count a proposed caller phrase as correct if and only if it matches the annotation of the evaluation data perfectly. The numbers could be made to look better by using containment as the evaluation criterion, i.e., we would count a proposed phrase as correct if it contained an actual phrase plus perhaps some additional material.</Paragraph> <Paragraph position="4"> While this may be more useful in practice (see below), it is not the objective that was maximized during training, and so we prefer the stricter criterion for evaluation on previously annotated transcripts.</Paragraph> <Paragraph position="5"> (manual transcriptions) While the results for the approach proposed here appear clearly worse than those reported by Huang et al. (2001), we hasten to point out that this is most likely not due to any difference in the corpora that were used. This is corroborated by the fact that we were able to obtain performance much closer to that of the best, finely tuned log-linear model from (Huang et al., 2001) by using a generic named entity tagger that was not adapted in any way to the particular task at hand. The log-linear taggers employ n-gram features based on family names and other particular aspects of the development data that do not necessarily generalize to other settings, where the family names of the callers may be different or may not be transcribed properly. In fact, it seems rather likely that the log-linear models and the features they employ over-fit the training data.</Paragraph> <Paragraph position="6"> This becomes clearer when one evaluates on unseen transcripts produced by an automatic speech recognizer (ASR),3 as summarized in Table 2. Rows HZP strict and HZP containment repeat the figures for the best model from (Huang et al., 2001) when evaluated on automatic transcriptions. The difference is that HZP strict uses the strict evaluation criterion described above, whereas HZP containment uses the weaker criterion of containment, i.e., an extracted phrase counts as correct if it contains exactly one whole actual phrase. Row JA containment summarizes the performance of our approach when 3An automatic transcription is the single best word hypothesis of the ASR for a given voicemail message.</Paragraph> <Paragraph position="7"> evaluated on 101 unseen automatically transcribed messages. Since we did not have any labeled automatic transcriptions available to compare with the predicted caller phrase labels using the strict criterion, we only report results based on the weaker criterion of containment. In fact, we count caller phrases as correct as long as they contain the full name of the caller, since this is the common denominator in the otherwise somewhat heterogeneous labeling of our training corpus; more on this issue in the next section.</Paragraph> <Paragraph position="8"> The difference between the approach in (Huang et al., 2001) and ours may be partly due to the performance of the ASR components: Huang et al. (2001) report a word error rate of 'about 35%', whereas we used a recognizer (Bacchiani, 2001) with a word error rate of only 23%. Still, the reduced performance of the HZP model on ASR transcripts compared with manual transcripts is points toward overfitting, or reliance on features that do not generalize to ASR transcripts. Our main approach, on the other hand, uses classifiers that are extremely knowledge-poor in comparison with the many features of the log-linear models for the various named entity taggers, employing no more than a few dozen categorical features.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Caller Names </SectionTitle> <Paragraph position="0"> Extracting an entire caller phrase like 'this is Pat Caller' may not be all that relevant in practice: the prefix 'this is' does not provide much useful information, so simply extracting the name of the caller should suffice. This is more or less a problem with the annotation standard used for marking up voicemail transcripts. We decided to test the effects of changing that standard post hoc. This was relatively easy to do, since proper names are capitalized in the message transcripts. We heuristically identify caller names as the leftmost longest contiguous sub-sequence of capitalized words inside a marked-up caller phrase. This leaves us with 6731 messages with caller names in our development data.4 As we did for caller phrases, we briefly examine the distributions of the start position of caller names (see Figure 3) as well as their lengths (see Figure 4).</Paragraph> <Paragraph position="1"> Comparing the entropies of the empirical distributions with the corresponding ones for caller phrases suggests that we might be dealing with a simpler extraction task here. The entropy of the empirical name length distribution is not much more than one bit, since predicting the length of a caller name is mostly a question of deciding whether a first name or full name was mentioned.</Paragraph> <Paragraph position="2"> starting x words into the message as part of their caller phrase employ the caller phrase 'it's me', which would be easy to detect and treat separately.</Paragraph> <Paragraph position="3"> The performance comparison in Table 3 shows that we are in fact dealing with a simpler task. Notice however that our method has not changed at all. We still use one classifier to predict the beginning of the caller name and a second classifier to predict its length, with the same small set of lexical features that do not include any names other than a handful of common first names.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Phone Numbers </SectionTitle> <Paragraph position="0"> The development data contain 5303 marked-up phone numbers, for an average of almost 0.7 phone numbers per non-empty message. These phone numbers fall into the following categories based on their realization: rections, hesitations, fragments, and questionable markup Note that phone numbers in the North American Numbering Plan are either ten or seven digits long, depending on whether the Numbering Plan Area code is included or not. Two other frequent lengths for phone numbers in the development data are four (for internal lines) and, to a lesser extent, eleven (when the long distance dialing prefix is included, as in 'one eight hundred . . . ').</Paragraph> <Paragraph position="1"> This allows us to formulate the following baseline approach: find all maximal substrings consisting of spoken digits ('zero' through 'nine') and keep those of length four, seven, and ten. Simple as it may seem, this approach (which we call digits below) performs surprisingly well. Its precision is more than 78%, partly because in our corpus there do not occur many seven or ten digit numbers that are not phone numbers.</Paragraph> <Paragraph position="2"> Named entity taggers based on conditional models with trigram features are not particularly suited for this task. The reason is that trigrams do not provide enough history to allow the tagger to judge the length of a proposed phone number: it inserts beginning and end tags without being able to tell how far apart they are. Data sparseness is another problem, since we are dealing with 1000 distinct trigrams over digits alone, so a different event model that replaces all spoken digits with the same representative token might be better suited, also because it avoids over-fitting issues like accidentally learning area codes and other number patterns that are frequent in the development data.</Paragraph> <Paragraph position="3"> However, there is a more serious problem. Even if the distance between the start and end tags that a named entity tagger predicts could be taken into account, this would not help with all spoken renditions of phone numbers. For example, '327-1025' could be read aloud using only six words ('three two seven ten twenty five'), and might be incorrectly rejected because it appears to be of a length that is not very common for phone numbers.</Paragraph> <Paragraph position="4"> We therefore approach the phone number extraction task differently, using a two-phase procedure. In the first phase we use a hand-crafted grammar to propose candidate phone numbers. This avoids all of the problems mentioned so far, yet the complexity of the task remains manageable because of the rather simple structure of most phone numbers in our development data noted above. The advantage is that it allows us to simultaneously convert spoken digits and numbers to a numeric representation, whose length can then be used as an important feature for deciding whether to keep or throw away a candidate. Note that such a conversion process is desirable in any case, since a text-based application would presumably want to present digit strings like '327-1025' to a user, rather than 'three two seven ten twenty five'. This conversion step is not entirely trivial, though: for example, one might transcribe the spoken words 'three hundred fourteen ninety nine' as either '300-1499' or '314.99' depending on whether they are preceded by 'call me back at' vs. 'I can sell it to you for', for example. But since we are only interested in finding phone numbers, the extraction component can treat all candidates it proposes as if they were phone numbers.</Paragraph> <Paragraph position="5"> Adjustments of the hand-crafted grammar were only made in order to increase recall on the development data. The grammar should locate as many actual phone numbers in the development corpus as possible, but was free to also propose spurious candidates that did not correspond to marked-up phone numbers. While it has recently been argued that such separate optimization of recall and precision is generally desirable for certain learning tasks (Agarwal and Joshi, 2001; Joshi et al., 2001), the main advantage in connection with hand-crafted components is simplified development. Since we noted above that 97% of all phone numbers in our development data are expressed fairly straightforwardly in terms of digits, numbers, and a few other words particular to the phone number domain, we might expect to achieve recall figures close to 97% without doing anything special to deal with the remaining 3% of difficult cases. It was very easy to achieve this recall figure on the development data, while the ratio of proposed phone numbers to actual phone numbers was about 3.2 at worst.5 A second phase is now charged with the task of weeding through the set of candidates proposed during the first phase, retaining those that correspond to actual phone numbers. This is a simple binary classification task, and again many different techniques can be applied. As a baseline we use a classifier that accepts any candidate of length four or more (now measured in terms of numeric digits, rather than spoken words), and rejects candidates of length three and less. Without this simple step (which we refer to as prune below), the precision of our hand-crafted extraction grammar is only around 30%, but by pruning away candidate phone numbers shorter than four digits precision almost doubles while recall is unaffected.</Paragraph> <Paragraph position="6"> We again used a decision tree learner to automatically infer a classifier for the second phase. The features we made available to the learner were the length of the phone number in numeric digits, its 5It would of course be trivial to achieve 100% recall by extracting all possible substrings of a transcript. The fact that our grammar extracts only about three times as many phrases as needed is evidence that it falls within the reasonable subset of possible extraction procedures.</Paragraph> <Paragraph position="7"> distance from the end of the message, and a small number of lexical cues in the surrounding context of a candidate number ('call', 'number', etc.). This approach (which we call classify below) increases the precision of the combined two steps to acceptable levels without hurting recall too much.</Paragraph> <Paragraph position="8"> A comparison of performance results is presented in Table 4. Rows HZP rules and HZP log-linear refer to the rule-based baseline and the best log-linear model of (Huang et al., 2001) and the figures are simply taken from that paper; row Col log-linear refers to the same named entity tagger we used in the previous section and is included for comparison with the HZP models; row JA digits refers to the simple baseline where we extract strings of spoken digits of plausible lengths. Our main results appear in the remaining rows. The performance of our hand-crafted extraction grammar (in row JA extract) was about what we had seen on the development data before, with recall being as high as one could reasonably expect. As mentioned above, using a simple pruning step in the second phase (see JA extract + prune) results in a doubling of precision and leaves recall essentially unaffected (a single fragmentary phone number was wrongly excluded). Finally, if we use a decision tree classifier in the second phase, we can achieve extremely high precision with a minimal impact on recall. Our two-phase procedure outperforms all other methods we considered.</Paragraph> <Paragraph position="9"> (manual transcriptions) We evaluated the performance of our best models on the same 101 unseen ASR transcripts used above in the evaluation of the caller phrase extraction. The results are summarized in Table 5, which also repeats the best results from (Huang et al., 2001), using the same terminology as earlier: rows HZP strict and HZP containment refer to the best model from (Huang et al., 2001) - corresponding to row HZP log-linear in Table 4 - when evaluated using the strict criterion and containment, respectively; and row JA containment refers to our own best model - corresponding to row JA extract + classify in Table 4.</Paragraph> <Paragraph position="10"> It is not very plausible that the differences between the approaches in Table 5 would be due to a difference in the performance of the ASR components that generated the message transcripts. From inspecting our own data it is clear that ASR mistakes inside phone numbers are virtually absent, and we would expect the same to hold even of an automatic recognizer with an overall much higher word error rate. Also, for most phone numbers the labeling is uncontroversial, so we expect the corpora used by Huang et al. (2001) and ourselves to be extremely similar in terms of mark-up of phone numbers. So the observed performance difference is most likely due to the difference in extraction methods.</Paragraph> </Section> class="xml-element"></Paper>