File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3404_metho.xml
Size: 21,748 bytes
Last Modified: 2025-10-06 14:10:58
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3404"> <Title>You Are What You Say: Using Meeting Participants' Speech to Detect their Roles and Expertise</Title> <Section position="4" start_page="24" end_page="24" type="metho"> <SectionTitle> 3 Functional Roles </SectionTitle> <Paragraph position="0"> Meeting participants have functional roles that ensure the smooth conduct of the meeting, without regard to the specific contents of the meeting.</Paragraph> <Paragraph position="1"> These roles may include that of the meeting leader whose functions typically include starting the meeting, establishing the agenda (perhaps in consultation with the other participants), making sure the discussions remain on-agenda, moving the discussion from agenda item to agenda item, etc. Another possible functional role is that of a the designated meeting scribe. Such a person may be tasked with the job of taking the official notes or minutes for the meeting.</Paragraph> <Paragraph position="2"> Currently we are attempting to automatically detect the meeting leader for a given meeting. In our data (as described in section 2) the participant playing the role of the manager is always the meeting leader. In section 5 we describe our methodology for automatically detecting the meeting leader.</Paragraph> </Section> <Section position="5" start_page="24" end_page="25" type="metho"> <SectionTitle> 4 Expertise </SectionTitle> <Paragraph position="0"> Typically each participant in a meeting makes contributions to the discussions at the meeting (and to the project or organization in general) based on their own expertise or skill set. For example, a project to build a multi-modal note taking application may include project members with expertise in speech recognition, in video analysis, etc. We define expertise based roles as roles based on skills that are relevant to participants' contributions to the meeting discussions and the project or organization in general. Note that the expertise role a participant plays in a meeting is potentially dependent on the expertise roles of the other participants in the meeting, and that a single person may play different expertise roles in different meetings, or even within a single meeting. For example, a single person may be the &quot;speech recognition expert&quot; on the note taking application project that simply uses off-the-shelf speech recognition tools to perform note taking, but a &quot;noise cancellation&quot; expert on the project that is attempting to improve the in-house speech recognizer. Automatically detecting each participant's roles can help such meeting understanding components as the action item detector.</Paragraph> <Paragraph position="1"> Ideally we would like to automatically discover the roles that each participant plays, and cluster these roles into groups of similar roles so that the meeting understanding components can transfer what they learn about particular participants to other (and newer) participants with similar roles. Such a role detection mechanism would need no prior training data about the specific roles that participants play in a new organization or project. Currently however, we have started with a simplified participant role detection task where we do have training data pertinent to the specific roles that meeting participants play in the test set of meetings. As mentioned in section 2, our data consists of people playing two kinds of expertise-based roles - that of a hardware acquisition expert, and that of a building facilities expert. In the next section we discuss our methodology of automatically detecting these roles from the meeting participants' speech.</Paragraph> </Section> <Section position="6" start_page="25" end_page="27" type="metho"> <SectionTitle> 5 Methodology </SectionTitle> <Paragraph position="0"> Given a sequence of longitudinal meetings, we define our role detection task as a three-way classification problem, where the input to the classifier consists of features extracted from the speech of a particular participant over the given meetings, and the output is a probability distribution over the three possible roles. Note that although a single participant can simultaneously play both a functional and an expertise-based role, in the Y2 Scenario Data each participant plays exactly one of the three roles. We take advantage of this situation to simplify the problem to the three way classification defined above. We induce a decision tree (Quinlan, 1986) classifier from hand labeled data. In the next sub-section we describe the steps involved in training the decision tree role classifier, and in the subsequent subsection we describe how the trained decision tree is used to arrive at a role label for each meeting participant. null</Paragraph> <Section position="1" start_page="25" end_page="26" type="sub_section"> <SectionTitle> 5.1 Training </SectionTitle> <Paragraph position="0"> One of the sources of information that we wish to employ to perform functional and expertise role detection is the words that are spoken by each participant over the course of the meetings. Our approach to harness this information source is to use labeled training data to first create a set of words most strongly associated with each of the three roles, and then use only these words during the feature extraction phase to detect each participant's role, as described in section 5.1.2.</Paragraph> <Paragraph position="1"> We created this list of keywords as follows. Given a training set of meeting sequences, we aggregated for each role all the speech from all the participants who had played that role in the training set. We then split this data into individual words and removed stop words - closed class words (mainly articles and prepositions) that typically contain less information pertinent to the task than do nouns and verbs. For all words across all the three roles, we computed the degree of association between each word and each of the three roles, using the chi squared method (Yang and Pedersen, 1997), and chose the top 200 high scoring word-role pairs. Finally we manually examined this list of words, and removed additional words that we deemed to not be relevant to the task (essentially identifying a domain-specific stop list). This reduced the list to a total of 180 words. The 5 most frequently occurring words in this list are: computer, right, need, week and space. Intuitively the goal of this keyword selection pre-processing step is to save the decision tree role classifier from having to automatically detect the important words from a much larger set of words, which would require more data to train.</Paragraph> <Paragraph position="2"> The input to the decision tree role classifier is a set of features abstracted from a specific participant's speech. One strategy is to extract exactly one set of features from all the speech belonging to a participant across all the meetings in the meeting sequence. However, this approach requires a very large number of meetings to train. Our chosen strategy is to sample the speech output by each participant multiple times over the course of the meeting sequence, classify each such sample, and then aggregate the evidence over all the samples to arrive at the overall likelihood that a participant is playing a certain role. To perform the sampling, we split each meeting in the meeting sequence into a sequence of contiguous windows each n seconds long, and then compute one set of features from each participant's speech during each window. The value of n is decided through parametric tests (described in section 7.1).</Paragraph> <Paragraph position="3"> If a particular participant was silent during the entire duration of a particular window, then features are extracted from that silence.</Paragraph> <Paragraph position="4"> Note that in the above formulation, there is no overlap (nor gap) between successive windows. In a separate set of experiments we used overlapping windows. That is, given a window size, we moved the window by a fixed step size (less than the size of the window) and computed features from each such overlapping window. The results of these experiments were no better than those with non-overlapping windows, and so for the rest of this paper we simply report on the results with the non-overlapping windows.</Paragraph> <Paragraph position="5"> Given a particular window of speech of a partic- null ular participant, we extract the following 2 speech length based features: * Rank of this participant (among this meeting's participants) in terms of the length of his speech during this window. Thus, if this participant spoke the longest during the window, he has a feature value of 1, if he spoke for the second longest number of times, he has a feature value of 2, etc.</Paragraph> <Paragraph position="6"> * Ratio of the length of speech of this participant in this window to the total length of speech from all participants in this window. Thus if a participant spoke for 3 seconds, and the total length of speech from all participants in this window was 6 seconds, his feature value is 0.5. Together with the rank feature above, these two features capture the amount of speech contributed by each participant to the window, relative to the other participants.</Paragraph> <Paragraph position="7"> In addition, for each window of speech of a particular participant, and for each keyword in our list of pre-decided keywords, we extract the following 2 features: * Rank of this participant (among this meeting's participants) in terms of the number of times this keyword was spoken. Thus if in this window of time, this participant spoke the keyword printer more often than any of the other participants, then his feature value for this keyword is 1.</Paragraph> <Paragraph position="8"> * Ratio of the number of times this participant uttered this keyword in this window to the total number of times this keyword was uttered by all the participants during this window. Thus if a participant spoke the word printer 5 times in this window, and in total all participants said the word printer 7 times, then his feature value for this keyword is 5/7. Together with the key-word rank feature above, these two features capture the number of times each participant utters each keyword, relative to the other participants. null Thus for each participant, for each meeting window, we extract two features based on the lengths of speech, and 2x180 features for each of the 180 keywords, for a total of 362 features. The true output label for each such data point is the role of that participant in the meeting sequence. We used these data points to induce a classifier using the Weka Java implementation (Witten and Frank, 2000) of the C4.5 decision tree learning algorithm (Quinlan, 1986).</Paragraph> <Paragraph position="9"> This classifier takes features as described above as input, and outputs class membership probabilities, where the classes are the three roles. Note that for the experiments in this paper we extract these features from the manual transcriptions of the speech of the meeting participants. In the future we plan to perform these experiments using the transcriptions output by an automatic speech recognizer.</Paragraph> </Section> <Section position="2" start_page="26" end_page="26" type="sub_section"> <SectionTitle> 5.2 Detecting Roles in Unseen Data 5.2.1 Classifying Windows of Unseen Data </SectionTitle> <Paragraph position="0"> Detecting the roles of meeting participants in unseen data is performed as follows: First the unseen test data is split into windows of the same size as was used during the training regime. Then the speech activity and keywords based features are extracted (using the same keywords as was used during the training) for each participant in each window. Finally these data points are used as input into the trained decision tree, which outputs class membership probabilities for each participant in each window.</Paragraph> </Section> <Section position="3" start_page="26" end_page="27" type="sub_section"> <SectionTitle> 5.2.2 Aggregating Evidence to Assign One Role Per Participant </SectionTitle> <Paragraph position="0"> Thus for each participant we get as many probability distributions (over the three roles) as there are windows in the test data. The next step is to aggregate these probabilities over all the windows and arrive at a single role assignment per participant. We employ the simplest possible aggregation method: We compute, for each participant, the average probability of each role over all the windows, and then normalize the three average role probabilities so calculated, so they still sum to 1. In the future we plan to experiment with more sophisticated aggregation mechanisms that jointly optimize the probabilities of the different participants, instead of computing them independently.</Paragraph> <Paragraph position="1"> At this point, we could assign to each participant his highest probability role. However, we wish to ensure that the set of roles that get assigned to the participants in a particular meeting are as diverse as possible (since typically meetings are forums at which different people of different expertise convene to exchange information). To ensure such diversity, we apply the following heuristic. Once we have all the average probabilities for all the roles for each participant in a sequence of meetings, we assign roles to participants in stages. At each stage we consider all participants not yet assigned roles, and pick that participant-role pair, say (p,r), that has the highest probability value among all pairs under consideration. We assign participant p the role r, and then discount (by a constant multiplicative factor) the probability value of all participant-role pairs (pi,rj) where pi is a participant not assigned a role yet, and rj = r. This makes it less likely (but not impossible) that another participant will be assigned this same role r again. This process is repeated until all participants have been assigned a role each.</Paragraph> </Section> </Section> <Section position="7" start_page="27" end_page="27" type="metho"> <SectionTitle> 6 Evaluation </SectionTitle> <Paragraph position="0"> We evaluated the algorithm by computing the accuracy of the detector's role predictions. Specifically, given a meeting sequence we ran the algorithm to assign a role to each meeting participant, and computed the accuracy by calculating the ratio of the number of correct assignments to the total number of participants in the sequence. Note that it is also possible to evaluate the window-by-window classification of the decision tree classifiers; we report results on this evaluation in section 7.1.</Paragraph> <Paragraph position="1"> To evaluate this participant role detection algorithm, we first trained the algorithm on the training set of meetings. The training phase included key-word list creation, window size optimization, and the actual induction of the decision tree. On the training data, a window size of 300 seconds resulted in the highest accuracy over the training set. The test at the root of the induced tree was whether the participant's rank in terms of speech lengths was 1, in which case he was immediately classified as a meeting leader. That is, the tree learnt that the person who spoke the most in a window was most likely the meeting leader. Other tests placed high in the tree included obvious ones such as testing for the keywords computer and printer to classify a participant as a hardware expert.</Paragraph> <Paragraph position="2"> We then tested this trained role detector on the testing set of meetings. Recall that the test set had 5 meeting sequences, each consisting of 5 meetings and a total of 20 meeting participants. Over this test set we obtained a role detection accuracy of 83%.</Paragraph> <Paragraph position="3"> A &quot;classifier&quot; that randomly assigns one of the three roles to each participant in a meeting (without regard to the roles assigned to the other participants in the same meeting) would achieve a classification accuracy of 33.3%. Thus, our algorithm significantly beats the random classifier baseline. Note that as mentioned earlier, the experiments in this paper are based on the manually transcribed speech.</Paragraph> </Section> <Section position="8" start_page="27" end_page="28" type="metho"> <SectionTitle> 7 Further Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="27" end_page="28" type="sub_section"> <SectionTitle> 7.1 Optimizing the Window Size </SectionTitle> <Paragraph position="0"> As mentioned above, one of the variables to be tuned during the training phase is the size of the window over which to extract speech features. We ran a sequence of experiments to optimize this window size, the results of which are summarized in figure 1. In this set of experiments, we performed the evaluation on two levels of granularity. The larger granularity level was the &quot;meeting sequence&quot; granularity, where we ran the usual evaluation described above. That is, for each participant we first used the classifier to obtain probability distributions over the 3 roles on every window, and then aggregated these distributions to reach a single role assignment for the participant over the entire meeting sequence. This role was compared to the true role of the participant to measure the accuracy of the algorithm. The smaller granularity level was the &quot;window&quot; level, where after obtaining the probability distribution over the three roles for a particular window of a particular participant, we picked the role with the highest probability, and assigned it to the participant for that window. Therefore, for each window we had a role assignment that we compared to the true role of the participant, resulting in an accuracy value for the classifier for every window for every participant.</Paragraph> <Paragraph position="1"> Note that the main difference between evaluation at these two granularity levels is that in the &quot;window&quot; granularity, we did not have any aggregation of evidence across multiple windows.</Paragraph> <Paragraph position="2"> For different window sizes, we plotted the accuracy values obtained on the test set for the two evalu- null ation granularities, as shown in figure 1. Notice that by aggregating the evidence across the windows, the detection accuracy improves for all window sizes.</Paragraph> <Paragraph position="3"> This is to be expected since in the window granularity, the classifier has access to only the information contained in a single window, and is therefore more error prone. However by merging the evidence from many windows, the accuracy improves.</Paragraph> <Paragraph position="4"> As window sizes increase, detection accuracy at the window level improves, because the classifier has more evidence at its disposal to make the decision.</Paragraph> <Paragraph position="5"> However, detection at the meeting sequence level gets steadily worse, potentially because the larger the window size, the fewer the data points it has to aggregate evidence from. These lines will eventually meet when the window size equals the size of the entire meeting sequence.</Paragraph> <Paragraph position="6"> A valid concern with these results is the high level of noise, particularly in the aggregated detection accuracy over the meeting sequence. One reason for this is that there are far fewer data points at the meeting sequence level than at the window level. With larger data sets (more meeting sequences as well as more participants per meeting) these results may stabilize. Additionally, given the small amount of data, our feature set is quite large, so a more aggressive feature set reduction might help stabilize the results.</Paragraph> </Section> <Section position="2" start_page="28" end_page="28" type="sub_section"> <SectionTitle> 7.2 Automatic Improvement over Unseen Data </SectionTitle> <Paragraph position="0"> One of our goals is to create an expertise based role detector system that improves over time as it has access to more and more meetings for a given par- null ticipant. This is especially important because the roles that a participant plays can change over time; we would like our system to be able to track these changes. In the Y2 Scenario Data that we have used in this current work, the roles do not change from meeting to meeting. However observe that our evidence aggregation algorithm fuses information from all the meetings in a specific sequence of meetings to arrive at a single role assignment for each participant. null To quantify the effect of this aggregation we computed the role detection accuracy using different numbers of meetings from each sequence. Specifically, we computed the accuracy of the role detection over the test data using only the last meeting of each sequence, only the last 2 meetings of each sequence, and so on until we used every meeting in every sequence. The results are summarized in figure 2. When using only the last meeting in the sequence to assign roles to the participants, the accuracy is only 66.7%, when using the last two meetings, the accuracy is 75%, and using the last three, four or all meetings results in an accuracy of 83%. Thus, the accuracy improves as we have more meetings to combine evidence from, as is expected. However the accuracy levels off at 83% when using three or more meetings, perhaps because there is no new information to be gained by adding a fourth or a fifth meeting.</Paragraph> </Section> </Section> class="xml-element"></Paper>