File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1039_metho.xml
Size: 17,654 bytes
Last Modified: 2025-10-06 14:07:39
<?xml version="1.0" standalone="yes"?> <Paper uid="P01-1039"> <Title>Information Extraction From Voicemail</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 The Database </SectionTitle> <Paragraph position="0"> Our work focuses on a database of voicemail messages gathered at IBM, and made publicly available through the LDC. This database and related speech recognition work is described fully by (Huang et al., 2000). We worked with approximately a0a2a1a4a3a5a3a5a3 messages, which we divided into a6a2a1a8a7a9a3a5a3 messages for training, a0a10a3a5a3 for development test set, and a11 a3a5a3 for evaluation test set. The messages were manually transcribed 2, and then a human tagger identified the portions of each message that specified the caller and any return numbers that were left. In this work, we take a broad view of what constitutes a caller or number. The caller was defined to be the consecutive sequence of words that best answered the question &quot;who called?&quot;. The definition of a number we used is a sequence of consecutive words that enables a return call to be placed. Thus, for example, a caller might be &quot;Angela from P.C. Labs,&quot; or &quot;Peggy Cole Reed Balla's secretary&quot;. Similarly, a number may not be a digit string, for example: &quot;tieline eight oh five six,&quot; or &quot;pager one three five&quot;. No more than one caller was identified for a single message, though there could be multiple numbers. The training of the maximum entropy model and statistical transducer are done on these annotated scripts.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 A Baseline Rule-Based System </SectionTitle> <Paragraph position="0"> In voicemail messages, people often identify themselves and give their phone numbers in highly stereotyped ways. So for example, someone might say, &quot;Hi Joe it's Harry...&quot; or &quot;Give me a call back at extension one one eight four.&quot; Our baseline system takes advantage of this fact by enumerating a set of transduction rules - in the form of a flex program - that transduce out the key information in a call.</Paragraph> <Paragraph position="1"> The baseline system is built around the notion of &quot;trigger phrases&quot;. These hand-crafted phases are patterns that are used in the flex program to recognize caller's identity and phone numbers.</Paragraph> <Paragraph position="2"> 2The manual transcription has a a12a14a13 word error rate Examples of trigger phrases are &quot;Hi this is&quot;, and &quot;Give me a call back at&quot;. In order to identify names and phone numbers as generally as possible, our baseline system has defined classes for person-names and numbers.</Paragraph> <Paragraph position="3"> In addition to trigger phrases, &quot;trigger suffixes&quot; proved to be useful for identifying phone numbers. For example, the phrase &quot;thanks bye&quot; frequently occurs immediately after the caller's phone number. In general, a random sequence of digits cannot be labeled as a phone number; but, a sequence of digits followed by &quot;thanks bye&quot; is almost certainly the caller's phone number. So when the flex program matches a sequence of digits, it stores it; then it tries to match a trigger suffix. If this is successful, the digit string is recognized a phone number string. Otherwise the digit string is ignored.</Paragraph> <Paragraph position="4"> Our baseline system has about 200 rules. Its creation was aided by an automatically generated list of short, commonly occurring phrases that were then manually scanned, generalized, and added to the flex program. It is the simplest of the systems presented, and achieves a good performance level, but suffers from the fact that a skilled person is required to identify the rules.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Maximum Entropy Model </SectionTitle> <Paragraph position="0"> Maximum entropy modeling is a powerful framework for constructing statistical models from data. It has been used in a variety of difficult classification tasks such as part-of-speech tagging (Ratnaparkhi, 1996), prepositional phrase attachment (Ratnaparkhi et al., 1994) and named entity tagging (Borthwick et al., 1998), and achieves state of the art performance. In the following, we briefly describe the application of these models to extracting caller's information from voicemail messages.</Paragraph> <Paragraph position="1"> The problem of extracting the information pertaining to the callers identity and phone number can be thought of as a tagging problem, where the tags are &quot;caller's identity,&quot; &quot;caller's phone number&quot; and &quot;other.&quot; The objective is to tag each word in a message into one of these categories.</Paragraph> <Paragraph position="2"> The information that can be used to predict a word's tag is the identity of the surrounding words and their associated tags. Let a15 denote the set of possible word and tag contexts, called &quot;histories&quot;, and a16 denote the set of tags. The maxent model is then defined over a15a18a17a19a16 ,and predicts the conditional probability a20a22a21a24a23a26a25 a27a29a28 for a tag a23 given the history a27 . The computation of this probability depends on a set of binary-valued &quot;features&quot;</Paragraph> <Paragraph position="4"> Given some training data and a set of features the maximum entropy estimation procedure computes a weight parameter a36 a31 for every feature a30a5a31 and parameterizes a20a22a21a24a23a37a25 a27a38a28 as follows:</Paragraph> <Paragraph position="6"> where a53 is a normalization constant.</Paragraph> <Paragraph position="7"> The role of the features is to identify characteristics in the histories that are strong predictors of specific tags. (for example, the tag &quot;caller&quot; is very often preceded by the word sequence &quot;this is&quot;). If a feature is a very strong predictor of a particular tag, then the corresponding a36 a31 would be high. It is also possible that a particular feature may be a strong predictor of the absence of a particular tag, in which case the associated a36 a31 would be near zero.</Paragraph> <Paragraph position="8"> Training a maximum entropy model involves the selection of the features and the subsequent estimation of weight parameters a36 a31 . The testing procedure involves a search to enumerate the candidate tag sequences for a message and choosing the one with highest probability. We use the &quot;beam search&quot; technique of (Ratnaparkhi, 1996) to search the space of all hypotheses.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Features </SectionTitle> <Paragraph position="0"> Designing effective features is crucial to the max-ent model. In the following sections, we describe the various feature functions that we experimented with. We first preprocess the text in the following ways: (1) map rare words (with counts less than a54 ) to the symbol &quot;UNKNOWN&quot;; (2) map words in a name dictionary to the symbol &quot;NAME.&quot; The first step is a way to handle out of vocabulary words in test data; the second step takes advantage of known names. This mapping makes the model focus on learning features which help to predict the location of the caller identity and leave the actual specific names later for extraction. null 4.1.1 Unigram lexical features To compute unigram lexical features, we used the neighboring two words, and the tags associated with the previous two words to define the history a55a57a56 as a55 a56a59a58a61a60a62a56a44a63a35a60a62a56a65a64a67a66a26a63a35a60a62a56a65a64a69a68a70a63a35a60a62a56a24a71a34a66a26a63a35a60a62a56a24a71a72a68a10a63a35a73a44a56a74a71a34a66a26a63a35a73a44a56a24a71a72a68 The features are generated by scanning each pair a75a32a55a57a56 a63a35a73 a56a46a76 in the training data with feature template in Table 1. Note that although the window is two words on either side, the features are defined in terms of the value of a single word.</Paragraph> <Paragraph position="2"> The trigger phrases used in the rule-based approach generally consist of several words, and turn out to be good predictors of the tags. In order to incorporate this information in the maximum entropy framework, we decided to use ngrams that occur in the surrounding word context to generate features. Due to data sparsity and computational cost, we restricted ourselves to using only bigrams. The bigram feature template is shown in First, a number dictionary is used to scan the training data and generate a code for each word which represents &quot;number&quot; or &quot;other&quot;. Second, a multi-word dictionary is used to match known pre-caller trigger prefixes and after-phonenumber trigger suffixes. The same code is assigned to each word in the matched string as either &quot;pre-caller&quot; or &quot;after-phone-number&quot;. The combined stream of codes is added to the history a55a57a56 and used to generate features the same way the word sequence are used to generate lexical features. null</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Feature selection </SectionTitle> <Paragraph position="0"> In general, the feature templates define a very large number of features, and some method is needed to select only the most important ones. A simple way of doing this is to discard the features that are rarely seen in the data. Discarding all features with fewer than a91a37a92 occurrences resulted in about a91a37a92 a63 a92a5a92a5a92 features. We also experimented with a more sophisticated incremental scheme. This procedure starts with no features and a uniform distribution a93a22a75 a73a37a94 a55a38a76 , and sequentially adds the features that most increase the data likelihood. The procedure stops when the gain in likelihood on a cross-validation set becomes small.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Transducer Induction </SectionTitle> <Paragraph position="0"> Our baseline system is essentially a hand specified transducer, and in this section, we describe how such an item can be automatically induced from labeled training data. The overall goal is to take a set of labeled training examples in which the caller and number information has been tagged, and to learn a transducer such that when voicemail messages are used as input, the transducer emits only the information-bearing words.</Paragraph> <Paragraph position="1"> First we will present a brief description of how an automaton structure for voicemail messages can be learned from examples, and then we describe how to convert this to an appropriate transducer structure. Finally, we extend this process so that the training procedure acts hierarchically on different portions of the messages at different times. In contrast to the baseline flex system, the transducers that we induce are nondeterministic and that multiple alignments are possible, the lowest cost transduction is preferred, with the costs being determined by the transition probabilities encountered along the paths.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Inducing Finite State Automata </SectionTitle> <Paragraph position="0"> Many techniques have evolved for inducing finite state automata from word sequences, e.g. (Oncina and Vidal, 1993; Stolcke and Omohundro, 1994; Ron et al., 1998), and we chose to adapt the technique of (Ron et al., 1998). This is a simple method for inducing acyclic automata, and is attractive because of its simplicity and theoretical guarantees. Here we present only an abbreviated description of our implementation, and refer the reader to (Ron et al., 1998) for a full description of the original algorithm. In (Appelt and Martin, 1999), finite state transducers were also used for named entity extraction, but they were hand specified. null The basic idea of the structure induction algorithm is to start with a prefix tree, where arcs are labeled with words, that exactly represents all the word sequences in the training data, and then to gradually transform it, by merging internal states, into a directed acyclic graph that represents a generalization of the training data. An example of a merge operation is shown in Figure 1.</Paragraph> <Paragraph position="1"> The decision to merge two nodes is based on the fact that a set of strings is rooted in each node of the tree, specified by the paths to all the reachable leaf nodes. A merge of two nodes is permissible when the corresponding sets of strings are statistically indistinguishable from one another.</Paragraph> <Paragraph position="2"> The precise definition of statistical similarity can be found in (Ron et al., 1998), and amounts to deeming two nodes indistinguishable unless one of them has a frequently occurring suffix that is rarely seen in the other. The exact ordering in which we merged nodes is a variant of the process described in (Ron et al., 1998) 3. The transition probabilities are determined by aligning the training data to the induced automaton, and counting the number of times each arc is used.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Conversion to a Transducer </SectionTitle> <Paragraph position="0"> Once a structure is induced for the training data, it can be converted into an information extracting transducer in a straightforward manner. When the automaton is learned, we keep track of which words were found in information-bearing portions of the call, and which were not. The structure of the transducer is identical to that of the automaton, but each arc makes a transduction. If the arc is labeled with a word that was information-bearing in the training data, then the word itself is transduced out; otherwise, an a95 epsilona96 is transduced. null</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.3 Hierarchical Structure Induction </SectionTitle> <Paragraph position="0"> Conceptually, it is possible to induce a structure for voicemail messages in one step, using the algorithm described in the previous sections. In practice, we have found that this is a very difficult problem, and that it is expedient to break it into a number of simpler sub-problems. This has led us to develop a three-step induction process in which only short segments of text are processed at once.</Paragraph> <Paragraph position="1"> First, all the examples of phone numbers are gathered together, and a structure is induced.</Paragraph> <Paragraph position="2"> Similarly, all the examples of caller's identities are collected, and a structure is induced for them To further simplify the task, we replaced number strings by the single symbol &quot;NUMBER+&quot;, and person-names by the symbol &quot;PERSON-NAME&quot;.</Paragraph> <Paragraph position="3"> The transition costs for these structures are estimated by aligning the training data, and counting 3A frontier of nodes is maintained, and is initialized to the children of the root. The weight of a node is defined as the number of strings rooted in it. At each step, the heaviest node is removed, and an attempt is made to merge it with another fronteir node, in order of decreasing weight. If a merge is possible, the result is placed on the frontier; otherwise, the heaviest node's children are added.</Paragraph> <Paragraph position="4"> segment&quot; structure in which it is embedded (bottom). For clarity, transition probabilities are not displayed. null the number of times the different transitions out of each state are taken. A phone number structure induced in this way from a subset of the data is shown at the top of Figure 2.</Paragraph> <Paragraph position="5"> In the second step, occurrences of names and numbers are replaced by single symbols, and the segments of text immediately surrounding them are extracted. This results in a database of examples like &quot;Hi PERSON-NAME it's CALLER-STRUCTURE I wanted to ask you&quot;, or &quot;call me at NUMBER-STRUCTURE thanks bye&quot;. In this example, the three words immediately preceding and following the number or caller are used. Using this database, a structure is induced for these segments of text, and the result is essentially an induced automaton that represents the trigger phrases that were manually identified in the baseline system. A small second level structure is shown at the bottom of Figure 2.</Paragraph> <Paragraph position="6"> In the third step, the structure of a background language model is induced. The structures discovered in these three steps are then combined into a single large automaton that allows any sequence of caller, number, and background segments. For the system we used in our experiments, we used a unigram language model as the background. In the case that information-bearing patterns exist in the input, it is desirable for paths through the non-background portions of the final automaton to have a lower cost, and this is most likely with a high perplexity background model.</Paragraph> </Section> </Section> class="xml-element"></Paper>