File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1081_metho.xml
Size: 16,436 bytes
Last Modified: 2025-10-06 14:07:50
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1081"> <Title>Data-driven Classification of Linguistic Styles in Spoken Dialogues</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Method and Results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Parameter values </SectionTitle> <Paragraph position="0"> For each turn a set of parameter values was computed.</Paragraph> <Paragraph position="1"> The STTS tagset consists of more than 50 different tags.</Paragraph> <Paragraph position="2"> In order to obtain reasonable results the STTS tagset was collapsed to a set with 12 classes. Their frequency distributions (henceforth Cxxx, e.g. CART for the frequency of articles) indicate the differences between the corpora (Figure 1). While the TABA corpus mainly contains nouns, prepositions, and particles, the two human-human corpora have many pronouns (pronominal referencing is possible due to longer contexts), verbs (varying tasks need task names, sentences in longer utterances are less likely to be elliptic), and adverbs (an utterance is put in relation to its context). The TABA corpus features nearly no interjections, while the number of numerals in the CH corpus is rather low (in the other corpora times and dates were explicit elements of the tasks). The length distributions (Lxxx) are similar with the exception of numerals in the VM corpus (dates are quite long in German, e.g. &quot;zweiundzwanzigster&quot; (22nd)). An additional set of parameters is the relative frequency of the different classes in phrase-final position (Fxxx) (Klarner, 1997).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Compute speaker values </SectionTitle> <Paragraph position="0"> Apart from frequency and average length of words in a class, several other parameters were computed and averaged for every speaker (Figure 2). Important differences exist in the length of the turns, and also in the number of words per sentence. An average VM turn has more than 6 times as many words as an average TABA turn.</Paragraph> <Paragraph position="1"> While the length of a phrase is fairly equal between VM and CH, the number of phrases (and, thus, words) per sentence is higher for VM. Neither casual nor formal addressing are present in the TABA corpus (talking to a machine) while the VM setting (negotiation of business appointments) evokes formal speech. The CH dialogues are mostly between family members and close friends (Linguistic Data Consortium, 1997), and casual addressings are frequent. Variations in the number of common words can be related to the list of common words used, which was based on the VM corpus. The larger number of different words per speaker in the TABA corpus results from less words per speaker with less chance for repetition. Average word length and density are similar in all three corpora.</Paragraph> <Paragraph position="2"> The correlation coefficients between parameters were computed for the three corpora. Those parameters that correlated well with another parameter (correlation coefficient a0 0.6) were omitted from the pertinent corpus. The correlation coefficients between WIP and WIS are strong for CH and moderate for VM, while those between PIS and WIS are moderate for CH and strong for VM. This indicates that longer sentences have longer phrases in the CH corpus but more phrases in the VM corpus. Different annotation styles and guidelines may have caused this phenomenon.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Principal component analysis </SectionTitle> <Paragraph position="0"> To normalize for the different ranges of the speaker-specific parameters the z scores (subtraction of the mean, division by standard deviation) are computed as input for the principal component analysis (PCA). The PCA was done with singular value decomposition on the data matrix. This is the preferred method for numerical accuracy Par. Meaning VM CH TABA SIT sentences per turn (sentence end is marked by a question mark or colon in the VM and CH corpora, in the TABA corpus no sentence boundaries are la- null for each corpus. Displayed are the values for the first quartile, the median, and the third quartile. (Mardia et al., 1979).</Paragraph> <Paragraph position="1"> The PCA results were used to assess the importance of the parameters. The important parameters (those that achieve high loads on the most important principal components) should not change much in dependence of the input set. Changes between the different corpora are likely regarding their structural differences, but ideally a stable set of parameters emerges that contains parameters important for all corpora.</Paragraph> <Paragraph position="2"> The input partition was varied by changing the number of minimal words per speaker in order to check for the stability of the components and for the influence of the interaction length. Figure 3 displays some results. It can be seen that the important parameters for a corpus do not change much if the input set is varied. The principal components also show strong similarities within a corpus.</Paragraph> <Paragraph position="3"> An interpretation of the components is always to be treated with caution. However, to ease further discussions, a tentative interpretation is given in Figure 4 for some of the principal components shown in Figure 3 where the interpretative labels can be motivated as follows: null adverbs: An adverb can take the position of a prepositional phrase or verb phrase, if an appropriate entity is present in the previous discourse context. Thus, tant principal components (PC) with varying input for the three corpora.</Paragraph> <Paragraph position="4"> the number of adverbs/adjectives loads inversely to the number of prepositions and nouns (VM-1 PC 1, VM-3 PC 4, CH-1 PC 4, CH-2 PC 3, CH-3 PC 2).</Paragraph> <Paragraph position="5"> pronouns: A pronoun or a noun can refer to a discourse entity, if that entity satisfies certain conditions. The number of pronouns loads inversely to the number of articles and nouns (VM-1 PC 4, CH-1 PC 2, CH-2 PC 2) and adverbs/adjectives (VM-2 PC 4, CH-3 PC 2). Pronominalization is only possible if the referred entity is mentioned in the very recent discourse context, ideally in the same turn; thus, longer turns favor pronominalization (VM-3 PC 3).</Paragraph> <Paragraph position="6"> ellipses: Ellipses are incomplete sentences where redundant information is omitted (often verbs, VM-2 PC 2, VM-3 PC 2). Final articles, adverbs, and adjectives can indicate elliptic utterances (CH-1 PC 3, CH-2 PC 4, CH-3 PC 4).</Paragraph> <Paragraph position="7"> turn complexity: Turns with pronouns and verbs are long and very likely contain a complete sentence (TABA-1 PC 1, TABA-2 PC 1, TABA-3 PC 1).</Paragraph> <Paragraph position="8"> content words: The ratio of content words (less common words) is high (TABA-1 PC 2, TABA-2 PC 2). played in Figure 3.</Paragraph> <Paragraph position="9"> interjections: Interjections are rare in the TABA corpus. An interjection may, therefore, distinguish speakers (with interjections) from others (without interjections) (TABA-1 PC 4, TABA-2 PC 4, TABA-3 PC 3).</Paragraph> <Paragraph position="10"> While some of these components are corpus-specific (e.g. interjections), others are important for all corpora (e.g. word length or ellipses / turn complexity) or, at least, for the syntactically complex human-human corpora (e.g. pronouns and adverbs).</Paragraph> <Paragraph position="11"> The stability of the components was further checked by forming four subsets of speakers and applying the PCA to all 16 combinations of these subsets. A higher number of observations (speakers) results in higher stability. For VM-1 and TABA-1 with 800 observations, all four components appear in every subset. For VM-2 and TABA-2 with 400 observations, three common components exist while the fourth (least important component) varies. For CH-1 and VM-3 with approximately 160 speakers, two common components are found in all subsets.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.4 Clustering </SectionTitle> <Paragraph position="0"> If a limited set of linguistically interpretable components exist, as has been argued for in the previous section, the question is whether speaker groups can be established, and whether unseen speakers can be reliably assigned the correct group.</Paragraph> <Paragraph position="1"> To establish classes the k-means algorithm was employed. This algorithm works by repeatedly moving all cluster centers to the mean of their Voronoi sets. The algorithm stops, if no cluster center has changed during the last iteration or the maximum number of iterations is reached (Hartigan and Wong, 1979). The initial cluster centers are randomly assigned, thus, slightly different re- null by the first two components for five runs with varied input values.</Paragraph> <Paragraph position="2"> sults are possible. The algorithm results in a predefined number of speaker clusters (Doux et al., 1997) that can be used to train automatic classifiers.</Paragraph> <Paragraph position="3"> If a specific interpretation of the clusters for a given task is desired the clustering can be done by hand (i.e. by explicitly constructing borders between classes). In the data-driven approach taken here, however, such hand-crafted constraints were avoided (the same holds for the PCA which could also be replaced by explicit rules).</Paragraph> <Paragraph position="4"> Figure 5 shows a distribution of cluster centers for five different runs on the same data but with different initial cluster centers. The distribution displayed here in the plane of the first two components is fairly stable.</Paragraph> <Paragraph position="5"> The four most important principal components (measured by their eigenvalue) computed for each speaker were used as input for the subsequent tests. This choice was motivated by the results of the component stability experiments described above. The current set of speakers does not support a more fine-grained distinction. Furthermore, it is unlikely that more dimensions will be useful for applications.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.5 Classification </SectionTitle> <Paragraph position="0"> A correct classification of individual linguistic styles described by the parameter set used in this experiment means that a speaker is put into the same class by the cluster analysis and the automatic classification. To test this hypothesis the turns for each speaker were alternately divided into two sets of the same size. The first set was used for the training of the classifier (calculate PCA, estimate clusters to obtain classes). The second set served as a test set. If the error rate (different classifications for the training set and the test set) is substantially lower than chance one can state that the parameters can be used to reliably discriminate between speaker classes.</Paragraph> <Paragraph position="1"> Neural networks were used for automatic classification. For training, two sets of input vectors were generated from the original training set. Every a2 -th pattern became part of a development set (a2 is 5 for all experiments described here). The output of the nets consisted of a3 output values (one for each class which is either 0 or 1). Fully connected feed forward nets with standard back propagation were trained until the error averaged over the last three runs on the development set begins to increase (overtraining). The net topology had one input layer, one output layer, and one hidden layer with the same number of nodes as the input layer.</Paragraph> <Paragraph position="2"> The speaker specific values for the test set and the values for the principal components were computed and used as input for the neural network. The classification of the network was judged correct for one speaker, if the cluster determined by the cluster analysis on the training set was equal to the class predicted by the neural network on the test set.</Paragraph> <Paragraph position="3"> The results are displayed in Figures 6 and 7. While the results for the TABA corpus are only slightly above chance level (25 %), the results for the VM and CH corpora indicate that for human-human corpora a speaker can be fairly reliably classified. If not enough turns or words are available the result decreases. The result decreases also if the number of speakers is too small (below 30, results are not shown here).</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Discussion </SectionTitle> <Paragraph position="0"> These results indicate that style classification in a spoken dialogue system is only feasible if the number of interactions (turns) is sufficiently high and linguistically rich. If this is the case, however, speakers can be classified according to simple part-of-speech distribution and turn length parameters which can be computed automatically (Section 3.1), which can be automatically combined to components interpretable as linguistic style indicators (Section 3.3), that can be employed to automatically group speakers into classes (Section 3.4), which, in turn, can be used by automatic learning methods to classify unseen turns from one speaker into the same class as the reference turns from the same speaker used during the training (Section 3.5).</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Applications </SectionTitle> <Paragraph position="0"> Several possible applications for linguistic style information exist in a spoken dialogue system. Among these are a4 style-specific language models, a4 style-specific grammars, a4 input to a general user model (certain elements of linguistic style can be related to paradigm or task knowledge, e.g. turn length, number of content words), a4 influencing the style of a language generation module. null An exploratory experiment was undertaken for the first application.</Paragraph> <Paragraph position="1"> Classification rate for the Verbmobil corpus corpora, displayed for varying input sets (see Figure 7 for a description).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Material and Method </SectionTitle> <Paragraph position="0"> The three corpora described above were used for the investigation. For the CH and VM corpora all speakers with more that 100 words were used; this threshold was set to 40 for the TABA corpus. The clustering method described above based on the results of a PCA was employed to group the speakers into 2 to 4 clusters based on 2 to 4 factors. The k-means clustering algorithm with a fixed set of clusters is initialized by random cluster centers. Thus, two subsequent runs give slightly different results. Therefore, each run was performed twice.</Paragraph> <Paragraph position="1"> The four corpora plus their combination were divided Task min. Words min. Turns Speakers Class.</Paragraph> <Paragraph position="2"> minimal number of words and turns per single speaker as constraints, and the resulting speaker number and classification rate.</Paragraph> <Paragraph position="3"> into 10 sets for a ten-fold cross-validation. One set served as development set for parameter optimization, another set was the test set, and the remaining eight sets were used to train the language models. In each run five standard bigram language models were trained, one for each speaker-specific corpus and one for their combination. The perplexity was calculated for the pertinent test sets.</Paragraph> <Paragraph position="4"> In a subsequent step for each speaker-specific corpus the general and the specific language models were linearly interpolated; the interpolation factor was iterated over the values 0, 0.2, 0.4, 0.6, 0.8, and 1.0. The interpolation factor which gave the best results (smallest perplexity) for the development set was taken to compute the perplexity on the test set.</Paragraph> </Section> </Section> class="xml-element"></Paper>