File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0619_metho.xml
Size: 23,463 bytes
Last Modified: 2025-10-06 14:15:31
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0619"> <Title>Word Informativeness and Automatic Pitch Accent Modeling</Title> <Section position="5" start_page="148" end_page="148" type="metho"> <SectionTitle> 3. &quot;HOOVER dam.&quot; 4. &quot;Hoover TOWER.&quot; </SectionTitle> <Paragraph position="0"> While researchers have discussed the possible influence of semantic informativeness, there has been no known empirical study of the claim nor has this type of information been incorporated into computational models of prosody. In this work, we employ two measurements of informativeness. First, we adopt an information-based framework (Shannon, 1948), quantifying the &quot;Information Content', (IC)&quot; of a word as the negative log likelihood of a word in a corpus. The second measurement is TF*IDF (Term Frequency times Inverse Document Frequency) (Salton, 1989; Salton, 1991), which has been widely used :to quantify word importance in information retrieval tasks. Both IC and TF*IDF are well established measurements of informativeness and therefore, good candidates to investigate. Our empirical study shows that word informativeness not only is closely related to word accentuation, but also provides new power in pitch accent prediction. Our results suggest that information content is a valuable feature to be incoiporated in speech synthesis systems.</Paragraph> <Paragraph position="1"> In the following sections, we first define IC and TF*IDF. Then, a description of the corpus used in ~this study is provided. We then describe a Set of experiments conducted to study the relation between informativeness and pitch accent. We explain how machine learning techniques are used in the pitch accent modeling process. Our results show that: * Both iC and TF*IDF scores are strongly correlated with pitch accent assignment. null * IC is a more powerful predictor than TF*IDF.</Paragraph> <Paragraph position="2"> * IC provides better prediction power in pitch accent prediction than previous techniques.</Paragraph> <Paragraph position="3"> The investigated pitch accent models can be easily adopted by speech synthesis systems.</Paragraph> </Section> <Section position="6" start_page="148" end_page="149" type="metho"> <SectionTitle> 2 Definitions of IC and TF*IDF </SectionTitle> <Paragraph position="0"> Following the standard definition in information theory (Shannon, 1948; Fano, 1961; Cover and Thomas, 1991) the IC of a word</Paragraph> <Paragraph position="2"> where P(w) is the probability of the word w appearing in a corpus and P(w) is estimatted as: _~2 where F(w) is the frequency of w in the corpus and N is the accumulative occurrence of all the words in the corpus. Intuitively, if the probability of a word increases, its informativeness decreases and therefore it is less likely to be an information focus. Similarly, it is therefore less likely to be communicated with pitch prominence.</Paragraph> <Paragraph position="3"> TF*IDF is defined by two components multiplied together. TF (Term Frequency) is the word frequency within a document; IDF (Inverse Document Frequency) is the logarithm of the ratio of the total number of documents to the number of documents containing the word. The product of TF*IDF is higher if a word has a high frequency within the document, which signifies high importance for the current document, and low dispersion in the corpus, which signifies high specificity. In this research, we employed a variant of TF*IDF score used in SMART (Buckley, 1985), a popular information retrieval package:</Paragraph> <Paragraph position="5"> where F,~.dj is the the frequency of word wi in document dj, N is the total number of documents, Nw~ is the number of documents containing word w~ and M is the number of distinct stemmed words in document dj. IC and TF*IDF capture different kinds of informativeness. IC is a matrix global in the domain of a corpus and each word in a corpus has a unique IC score. TF*IDF captures the balance of a matrix local to a given document (TF) and a matrix global in a corpus (IDF). Therefore, the TF*IDF score of a word changes from one document to another (different TF). However, some global features are also captured by TF*IDF. For example, a common word in the domain tends to get low TF*IDF score in all the documents in the corpus.</Paragraph> </Section> <Section position="7" start_page="149" end_page="149" type="metho"> <SectionTitle> 3 Corpus Description </SectionTitle> <Paragraph position="0"> In order to empirically study the relations between word informativeness and pitch accent, we use a medical corpus which includes a speech portion and a text portion. The speech corpus includes fourteen segments which total about 30 minutes of speech. The speech was collected at Columbia Presbyterian Medical Center (CPMC) where doctors informed residents or nurses about the postoperative status of a patient who has just undergone a bypass surgery. The speech corpus was transcribed orthographically by a medical professional and is also intonationally labeled with pitch accents by a ToBI (Tone and Break Index) (Silverman et al., 1992; Beckman and Hirschberg, 1994) expert. The text corpus includes 1.24 million, 2,422 discharge summaries, spanning a larger group of patients. The majority of the patients have also undergone cardiac surgery. The orthographic transcripts as well as the text corpus are used to calculate the IC and TF*IDF scores. First, all the words in the text corpus as well as the speech transcripts are processed by a stemming model so that words like &quot;receive&quot; and &quot;receives&quot; are treated as one word. We employ a revised version of Lovins' stemming algorithm (Lovins, 1968) which is implemented in SMART. Although the usefulness of stemming is arguable, we choose to use stemming because we think &quot;receive&quot; and &quot;receives&quot; are equally likely to be accented. Then, IC and TF*IDF are calculated. After this, the effectiveness of informativeness in accent placement is verified using the speech corpus. Each word in the speech corpus has an IC score, a TF*IDF score, a part-of-speech (POS) tag and a pitch accent label. Both IC and TF*IDF are used to test the correlation between informativeness and accentuation. POS is also investigated by several machine learning techniques in automatic pitch accent modeling.</Paragraph> </Section> <Section position="8" start_page="149" end_page="154" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> We conducted a series of experiments to determine whether there is a correlation between informativeness and pitch accent and whether informativeness provides an improvement over other known indicators on pitch accent, such as part-of-speech. We experimented with different forms of machine learning to integrate indicators within a single framework, testing whether rule induction or hidden Markov modeling provides a better model.</Paragraph> <Section position="1" start_page="149" end_page="150" type="sub_section"> <SectionTitle> 4.1 Ranking Word Informativeness </SectionTitle> <Paragraph position="0"> in the Corpus Table 1 and 2 shows the most and least informative words in the corpus. The IC order indicates the rank among all the words in the corpus, while TF*IDF order in the table indicates the rank among the words within a document. The document was picked randomly from the corpus. In general, most of the least informative words are function words, such as &quot;with&quot; or &quot;and&quot;. However, some content words are selected, such as &quot;patient&quot;, &quot;year&quot;, &quot;old&quot;. These content words are very common in this domain and are mentioned in almost all the documents in the corpus. In contrast, the majority of the most informative words are content words. Some of the selections are less expected. For example &quot;your&quot; ranks as the most informative word in a document using TF*IDF. This indicates that listeners or readers are rarely addressed in the corpus. It appears only once in the entire corpus.</Paragraph> </Section> <Section position="2" start_page="150" end_page="150" type="sub_section"> <SectionTitle> 4.2 Testing the Correlation of Informativeness and Accent Prediction </SectionTitle> <Paragraph position="0"> In order to verify whether word informativeness is correlated with pitch accent, we employ Spearman's rank correlation coefficient p and associated test (Conover, 1980) to estimate the correlations between IC and pitch prominence as well as TF*IDF and pitch prominence. As shown in Table 3, both IC and TF*IDF are closely correlated to pitch accent with a significance level p = 2.67.10 -85 and p = 2.90. i0 TM respectively.</Paragraph> <Paragraph position="1"> Because the! correlation coefficient p is positive, this indicates that the higher the IC and TF*IDF are, the more likely a word is to be accented.</Paragraph> </Section> <Section position="3" start_page="150" end_page="152" type="sub_section"> <SectionTitle> 4.3 Learning IC and TF*IDF Accent Models </SectionTitle> <Paragraph position="0"> The correlation test suggests that there is a strong connection between informativeness and pitch accent. But we also want to show how much performance gain can be achieved by adding this information to pitch accent models. To study the effect of TF*IDF and IC on pitch accent, we use machine learning techniques to learn models that predict the effect of these indicators on pitch accent. We use both RIPPER (Cohen, 1995) and Hidden Markov Models (HMM) (Rabiner and Juang, 1986) to build pitch accent models.</Paragraph> <Paragraph position="1"> RIPPER is a system that learns sets of classification rules from training data. It automatically selects rules which maximize the information gain and employs heuristics to decide when to stop to prevent over-fitting.</Paragraph> <Paragraph position="2"> The performance of RIPPER is comparable with most benchmark rule induction systems such as C4.5 (Quinlan, 1993). We train RIPPER on the speech corpus using 10-fold cross-validation, a standard procedure for training and testing when the amount of data is limited. In this experiment, the predictors are IC or TF*IDF, and the response variable is the pitch accent assignment. Once a set of RIPPER rules are acquired, they can be used to predict which word should be accented in a new corpus.</Paragraph> <Paragraph position="3"> HMM is a probability model which has been successfully used in many applications, such as speech recognition (Rabiner, 1989) and part-of-speech tagging (Kupiec, 1992).</Paragraph> <Paragraph position="4"> A HMM is defined as a triple: )~=(A, B, H) where A is a state transition probability matrix, B is a observation probability distribution matrix, and H is an initial state distribution vector. In this experiment, the hidden states are the accent status of words which can be either &quot;accented&quot; or &quot;not accented&quot;. The observations are IC or TF*IDF score of each word. Because of the limitation of the size of the speech corpus, we use a first-order HMM where the following condition is assumed:</Paragraph> <Paragraph position="6"> where Qt is the state at time t. Because we employ a supervised training process, no sophisticated parameter estimation procedure, such as the Baum-Welch algorithm (Rabiner, 1989) is necessary. Here all the parameters are precisely calculated using the following formula:</Paragraph> <Paragraph position="8"> where N is the number of hidden states and M is the number of observations.</Paragraph> <Paragraph position="9"> Once all the parameters of a HMM are set, we employ the Viterbi algorithm (Viterbi, 1967; Forney, 1973) to find an optimal accentuation sequence which maximizes the possibility of the occurrence of the observed IC or TF*IDF sequence given the HMM.</Paragraph> <Paragraph position="10"> Both RIPPER and HMM are widely accepted machine learning systems. However, their theoretical bases are very different.</Paragraph> <Paragraph position="11"> HMM focuses on optimizing a sequence of accent assignments instead of isolated accent assignment. By employing both of them, we want to show that our conclusions hold for both approaches. Furthermore, we expect HMM to do better than RIPPER because the influence of context words is incorporated. null We use a baseline model where all words are assigned a default accent status (accented). 52% of the words in the corpus are actually accented and thus, the baseline has a performance of 52~0. Our results in to predict pitch accent, performance is increased over the baseline of 52% to 67.25% and 65.66 deg7o for HMM and RIPPER respectively. In the IC model, the performance is further increased to 71.96% and 70.06%.</Paragraph> <Paragraph position="12"> These results 'are obtained by using 10-fold cross-validation. We can draw two conclusions from the results. First, both IC and TF*IDF are very effective in pitch accent prediction. All the improvements over the baseline model are statistically significant with p < i.Iii. 10 -16 1, using X 2 test (Fienberg, 1983; Fleiss, 1981). Second, the IC model is more powerful than the TF*IDF model. It out performs the TF*IDF model with p = 3.8.10 -5 for the HMM model and p = 0.0002 for the RIPPER model. The low p-values show the improvements achieved by the IC models are significant. Since IC performs better than TF*IDF in pitch accent prediction, we choose IC to measure informativeness in all the following experiments. Another observation of the results is that the HMM models do show some improvements over the RIPBER models. But the difference is marginal. More data is needed to test the significance of the improvements.</Paragraph> </Section> <Section position="4" start_page="152" end_page="154" type="sub_section"> <SectionTitle> 4.4 Incorporating IC in Reference Accent Models </SectionTitle> <Paragraph position="0"> In order to show that IC provides additional power in predicting pitch accent than current models, we need to directly compare the influence of IC with that of other reference models. In this section, we describe experiments that compare IC alone against iS reports p=0 because of underflow. The real p value is less than I.ii * 10 -16, which is the smallest value the comptlter can represent in this case a part-of-speech (POS) model for pitch accent prediction and then compare a model that integrates IC with POS against the POS model. Finally, anticipating the possibility that other features within a traditional TTS in combination with POS may provide equal or better performance than the addition of IC, we carried out experiments that directly compare the performance of Text-to-Speech (TTS) synthesizer alone with a model that integrates TTS with IC.</Paragraph> <Paragraph position="1"> In most speech synthesis systems, part-of-speech (POS) is the most powerful feature in pitch accent prediction. Therefore, showing that IC provides additional power over POS is important. In addition to the importance of POS within TTS for predicting pitch accent, there is a clear overlap between POS and IC. We have shown that the words with highest IC usually are content words and the words with lowest IC are frequently function words. This is an added incentive for comparing IC with POS models. Thus, we want to explore whether the new information added by IC can provide any improvement when both of them are used to predict accent assignment.</Paragraph> <Paragraph position="2"> In order to create a POS model, we first utilize MXPOST, a maximum entropy part-of-speech tagger (Ratnaparkhi, 1996) to get the POS information for each word. The performance of the MXPOST tagger is comparable with most benchmark POS taggers, such as Brill's tagger (Brill, 1994). After this, we map all the part-of-speech tags into seven categories: &quot;noun&quot;, &quot;verb&quot;, &quot;adjective&quot;, &quot;adverb&quot;, &quot;number&quot;, &quot;pronoun&quot; and &quot;others&quot;. The mapping procedure is conducted because keeping all the initial tags (about 35) will drastically increase the requirements for the amount of training data.</Paragraph> <Paragraph position="3"> The obtained POS tag is the predictor in the POS model. As shown in table 5, the performance of these two POS models are 71.33% and 70.52% for HMM and RIPPER respectively, which is comparable with that of the IC model. This comparison further shows the strength of IC because it has similar power to POS in pitch accent prediction and it is very easy to compute. When the POS models are augmented with IC, the POS+IC model performance is increased to 74.06% and 73.71% respectively. The improvement is statistically significant with p -- 0.015 for HMM model and p = 0.005 for RIPPER which means the new information captured by IC provides additional predicting power for the POS+IC models. These experiments produce new evidence confirming that IC is a valuable feature in pitch accent modeling.</Paragraph> <Paragraph position="4"> We also tried another reference model, Text-to-Speech (TTS) synthesizer output, to evaluate the results. The TTS pitch accent model is more comprehensive than the POS model. It has taken many features into consideration, such as discourse and semantic information. It is well established and has been evaluated in various situations. In this research, we adopted Bell Laboratories' TTS system (Sproat, 1997; Olive and Liberman, 1985; Hirschberg, 1990). We run it on our corpus first to get the TTS pitch accent assignments. Comparing the TTS accent assignment with the expert accent assignment, the TTS performance is 71.75% which is statistically significantly lower than the HMM POS+IC model with p = 0.039. We also tried to incorporate IC in TTS model.</Paragraph> <Paragraph position="5"> A simple way of doing this is to use the TTS output and IC as predictors and train them with our data. The obtained TTS+IC models achieve marginal improvement. The performance of TTS+IC model increases to 72.30% and 72.75% for HMM and RIPPER respectively, which is lower than that of the POS/IC models. We speculate that this is may be due to the corpus we used. The Bell Laboratories' TTS pitch accent model is trained in a totally different domain, and our medical corpus seems to negatively affect the TTS performance (71.75% compared to around 80%, its normal performance). Since the TTS+IC models involve two totally different domains, the effectiveness of IC may be compromised. If this assumption holds, we think that the TTSwIC model will perform better when IC is trained together with the TTS internal features on our corpus directly. But since this requires retraining a TTS system for a new domain and it is very hard for us to conduct such an experiment, no further comparison was conducted to verify this assumption.</Paragraph> <Paragraph position="6"> Although TF*IDF is less powerhfl than IC in pitch accent prediction, since they measure two different kinds of informativeness, it is possible that a TF*IDF+IC model can ! perform better than the IC model. Similarly, if TF*IDF is incorporated in the POS/IC model, the overall performance may increase for the POS+IC+TF*IDF model. However, our experiment shows no improvements when TF*IDF is incorporated in the IC and POS+IC model. Our experiments show that IC is always the dominant predictor when both IC and TF*IDF are presented.</Paragraph> </Section> </Section> <Section position="9" start_page="154" end_page="154" type="metho"> <SectionTitle> 5 Related Work </SectionTitle> <Paragraph position="0"> Information based approaches were applied in some natural languages applications before. In (Resnik, 1993; Resnik, 1995), IC was used to measure semantic similarity between words and it is shown to be more effective than traditional measurements of semantic distance within the WordNet hierarchy. A similar log-based informationlike measurement was also employed in (Leacock and Chodorow, 1998) to measure semantic similarity. TF*IDF scores are mainly used in keyword-based information retrieval tasks. For example, TF*IDF has been used in (Salton, :1989; Salton, 1991) to index the words ini a document and is also implemented in SMART (Buckley, 1985) which is a general;-purpose information retrieval package, providing basic tools and libraries to facilitate information retrieval tasks.</Paragraph> <Paragraph position="1"> Some early work on pitch accent prediction in speech synthesis only uses the distinction between content words and function words. Although this approach is simple, it tends to assign more pitch accents than necessary. We also tried the content/function word model on our corpus and as expected, we found it to be less powerful than the part-of-speech model. More advanced pitch accent models make use of other information, such as part-of-speech, given/new distinctions and contrast information (Hirschberg, 1993). Semantic information is also employed in predicting accent patterns for complex nominal phrases (Sproat, 1994). Other comprehensive pitch accent models have been suggested in (Pan and McKeown, 1998) in the framework of Concept-to-Speech generation where the output of a natural language generation system is used to predict pitch accent.</Paragraph> </Section> <Section position="10" start_page="154" end_page="154" type="metho"> <SectionTitle> 6 Discussion </SectionTitle> <Paragraph position="0"> Since IC is not a perfect measurement of informativeness, it can cause problems in accent prediction. Moreover, even if a perfect measurement of informativeness is available, more features may be needed in order to build a satisfactory pitch accent model. In this section, we discuss each of these issues.</Paragraph> <Paragraph position="1"> IC does not directly measure the informativeness of a word. It measures the rarity of a word in a corpus. That a word is rare doesn't necessarily mean that it is informative. Semantically empty words can be ranked high using IC as well. For example, CABG is a common operation in this domain. &quot;CABG&quot; is almost always used whenever the operation is mentioned. However, in a few instances, it is referred to as a &quot;CABG operation&quot;. As a result, the semantically empty word (in this context) &quot;operation&quot; gets a high IC score and it is very hard to distinguish high IC scores resulting from this situation from those that accurately measure informativeness and this causes problems in precisely measuring the IC of a word. Similarly, misspelled words also can have high IC score due to their rarity.</Paragraph> <Paragraph position="2"> Although IC is not ideal for quantifying word informativeness, even with a perfect measurement of informativeness, there are still many cases where this information by itself would not be enough. For example, each word only gets a unique IC score regardless of its context; yet it is well known that context information, such as given/new and contrast, plays an important role in accentuation. In the future, we plan to build a comprehensive accent model with more pitch accent indicators, such as syntactic, semantic and discourse features.</Paragraph> </Section> class="xml-element"></Paper>