File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-4011_metho.xml
Size: 11,967 bytes
Last Modified: 2025-10-06 14:10:19
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-4011"> <Title>AUTOMATEDQUALITYMONITORINGFORCALLCENTERSUSINGSPEECHANDNLP TECHNOLOGIES</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 1. INTRODUCTION </SectionTitle> <Paragraph position="0"> Every day, tens of millionsof help-deskcalls are recordedat call centersaroundthe world. As part of a typicalcall centeroperation a random sample of these calls is normally re-played to human monitorswho score the calls with respect to a variety of quality relatedquestions,e.g.</Paragraph> <Paragraph position="1"> * Was the accountsuccessfullyidentifiedby the agent? * Did the agent request error codes/messagesto help determine the problem? * Was the problemresolved? * Did the agent maintainappropriatetone, pitch,volumeand pace? This process suffers from a number of importantproblems: first, the monitoringat leastdoublesthe cost of eachcall (firstan operator is paid to take it, then a monitorto evaluateit). This causesthe second problem,which is that thereforeonly a very small sample of calls, e.g. a fraction of a percent, is typically evaluated. The third problemarisesfrom the fact that most calls are ordinaryand uninteresting;with random sampling,the human monitorsspend most of their time listeningto uninterestingcalls.</Paragraph> <Paragraph position="2"> This work describesan automatedquality-monitoringsystem that addresses these problems. Automaticspeech recognitionis used to transcribe 100% of the calls coming in to a call center, and default quality scores are assignedbased on featuressuch as key-words, key-phrases,the number and type of hesitations,and the average silence durations. The default score is used to rank the calls from worst-to-best,and this sorted list is made available to the human evaluators, who can thus spend their time listening only to calls for whichthere is some a-priorireasonto expect that there is somethinginteresting.</Paragraph> <Paragraph position="3"> The automatic quality-monitoringproblem is interesting in part becauseof the variabilityin how hard it is to answerthe questions. Some questions,for example,&quot;Did the agent use courteous words and phrases?&quot; are relatively straightforward to answer by looking for key words and phrases. Others, however, require essentiallyhuman-level knowledgeto answer;for exampleone company's monitorsare asked to answer the question &quot;Did the agent take ownershipof the problem?&quot; Our work focuseson calls from IBM's NorthAmericancallcenters,wherethereis a set of 31 questionsthat are usedto evaluatecall-quality. Becauseof the high degree of variabilityfound in these calls, we have investigatedtwo approaches: 1. Use a partial score based only on the subset of questions that can be reliablyanswered.</Paragraph> <Paragraph position="4"> 2. Use a maximum entropy classifier to map directly from ASR-generatedfeaturesto the probabilitythat a call is bad (definedas belongingto the bottom20% of calls).</Paragraph> <Paragraph position="5"> We have foundthat both approachesare workable,and we present final results based on an interpolationbetween the two scores. These results indicate that for a fixed amount of listening effort, the number of bad calls that are identified approximatelytriples with our call-rankingapproach.Surprisingly, whiletherehas been significantprevious scholarly research in automatedcall-routing and classificationin the call center , e.g. [1, 2, 3, 4, 5], there has been much less in automatedqualitymonitoringper se.</Paragraph> </Section> <Section position="4" start_page="0" end_page="292" type="metho"> <SectionTitle> 2. ASRFORCALLCENTERTRANSCRIPTION 2.1. Data </SectionTitle> <Paragraph position="0"> The speech recognition systems were trained on approximately 300 hours of 6kHz, mono audio data collectedat one of the IBM call centerslocatedin Raleigh,NC. The audio was manuallytranscribedand speaker turns wereexplicitlymarked in the word transcriptions but not the correspondingtimes. In order to detect speaker changesin the trainingdata, we did a forced-alignmentof the dataand choppedit at speaker boundaries.The test set consists of 50 callswith 113 speakers totalingabout 3 hoursof speech.</Paragraph> <Section position="1" start_page="0" end_page="292" type="sub_section"> <SectionTitle> 2.2. Speaker IndependentSystem </SectionTitle> <Paragraph position="0"> The raw acoustic features used for segmentationand recognition are perceptuallinear prediction(PLP) features. The features are acousticmodel consistsof 50K Gaussianstrained with MPE and uses a quinphonecross-word acousticcontext. The techniquesare the sameas those describedin [6].</Paragraph> </Section> <Section position="2" start_page="292" end_page="292" type="sub_section"> <SectionTitle> 2.3. IncrementalSpeaker Adaptation </SectionTitle> <Paragraph position="0"> In the context of speaker-adaptive training, we use two forms of feature-spacenormalization: vocal tract length normalization (VTLN) and feature-spaceMLLR (fMLLR,also known as constrained MLLR) to produce canonical acoustic models in which some of the non-linguisticsourcesof speechvariabilityhave been reduced.To this canonicalfeaturespace,we then apply a discriminatively trainedtransformcalled fMPE[7]. The speaker adapted recognitionmodel is trained in this resultingfeature space using MPE.</Paragraph> <Paragraph position="1"> We distinguishbetweentwo formsof adaptation:off-line and incrementaladaptation. For the former, the transformationsare computedper conversation-sideusing the full output of a speaker independentsystem.For thelatter, thetransformationsare updated incrementallyusingthedecodedoutputof thespeaker adaptedsystem up to the current time. The speaker adaptive transformsare then appliedto the futuresentences.The advantageof incremental adaptationis that it only requires a single decoding pass (as opposedto two passesfor off-lineadaptation)resultingin a decoding process which is twice as fast. In Table 1, we compare the performanceof the two approaches. Most of the gain of full offline adaptationis retainedin the incrementalversion.</Paragraph> <Paragraph position="2"> We use an HMM-basedsegmentationprocedure for segmenting the audio into speechand non-speechprior to decoding. The reason is that we want to eliminatethe non-speechsegmentsin order to reduce the computationalload during recognition. The speech segmentsare clusteredtogetherin orderto identifysegmentscoming from the samespeaker whichis crucialfor speaker adaptation. The clusteringis done via k-means,each segment being modeled by a single diagonalcovarianceGaussian. The metricis given by the symmetricK-L divergence between two Gaussians. The impact of the automaticsegmentationand clusteringon the errorrate</Paragraph> </Section> </Section> <Section position="5" start_page="292" end_page="292" type="metho"> <SectionTitle> 3. CALLRANKING </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="292" end_page="292" type="sub_section"> <SectionTitle> 3.1. QuestionAnswering </SectionTitle> <Paragraph position="0"> This section presents automated techniques for evaluating call quality. These techniques were developed using a training/development set of 676 calls with associated manually generatedqualityevaluations.The test set consistsof 195 calls.</Paragraph> <Paragraph position="1"> The qualityof the serviceprovidedby the help-deskrepresentatives is commonlyassessedby having humanmonitorslisten to a randomsampleof the callsand then fill in evaluationforms. The formfor IBM's NorthAmericanHelpDeskcontains31 questions.</Paragraph> <Paragraph position="2"> A subset of the questionscan be answeredeasily using automatic methods,amongthose the ones that check that the agent followed the guidelinese.g.</Paragraph> <Paragraph position="3"> * Did the agentfollow the appropriateclosingscript? * Did the agentidentifyherselfto the customer? But some of the questionsrequire human-level knowledge of the world to answer, e.g.</Paragraph> <Paragraph position="4"> * Did the agent ask pertinentquestionsto gain clarity of the problem? * Were all availableresourcesused to solve the problem? We were able to answer 21 out of the 31 questions using pattern matching techniques. For example, if the question is &quot;Did the agent follow the appropriateclosing script?&quot;, we search for</Paragraph> </Section> </Section> <Section position="6" start_page="292" end_page="293" type="metho"> <SectionTitle> &quot;THANK YOU FOR CALLING&quot;, &quot;ANYTHING ELSE&quot; and </SectionTitle> <Paragraph position="0"> &quot;SERVICEREQUEST&quot;.Any of these is a good partialmatch for the full script,&quot;Thankyou for calling,is there anything else I can help you with before closing this service request?&quot; Based on the answer to each of the 21 questions,we computea score for each call and use it to rank them. We label a call in the test set as being bad/good if it has been placed in the bottom/top20% by human evaluators. We report the accuracy of our scoring system on the test set by computing the number of bad calls that occur in the bottom20% of our sorted list and the numberof good calls found in the top 20% of our list. The accuracy numberscan be found in</Paragraph> <Section position="1" start_page="292" end_page="293" type="sub_section"> <SectionTitle> 3.2. MaximumEntropy Ranking </SectionTitle> <Paragraph position="0"> Anotheralternative for scoringcalls is to find arbitraryfeaturesin the speechrecognitionoutputthat correlatewith the outcomeof a call being in the bottom 20% or not. The goal is to estimatethe probabilityof a call being bad based on features extracted from the automatictranscription.To achieve this we build a maximum entropy based systemwhich is trainedon a set of calls with associatedtranscriptionsand manualevaluations.The following equation is used to determinethe score of a call C using a set of N predefinedfeatures:</Paragraph> <Paragraph position="2"> whereclass [?] {bad,not[?]bad}, Z is a normalizingfactor, fi() are indicatorfunctionsand {li}{i=1,N} are the parametersof the modelestimatedvia iterative scaling[8].</Paragraph> <Paragraph position="3"> Due to the fact that our trainingset containedunder700 calls, we used a hand-guided method for defining features. Specifically, we generated a list of VIP phrases as candidate features, e.g. &quot;THANKYOU FOR CALLING&quot;,and &quot;HELP YOU&quot;. We also createda pool of genericASR features,e.g. &quot;numberof hesitations&quot;,&quot;total silence duration&quot;,and &quot;longestsilence duration&quot;. A decisiontree was then used to select the most relevant features and the thresholdassociatedwitheachfeature.Thefinalset of featurescontained5 genericfeaturesand 25 VIPphrases.If we take a look at the weightslearnedfor differentfeatures,we can see that if a call has many hesitationsand long silencesthen most likely the call is bad.</Paragraph> <Paragraph position="4"> We use P(bad|C) as shown in Equation1 to rank all the calls.</Paragraph> <Paragraph position="5"> Table 3 shows the accuracy of this system for the bottomand top 20% of the test calls.</Paragraph> <Paragraph position="6"> At this point we have two scoring mechanismsfor each call: one that relies on answeringa fixed number of evaluation questions and a more global one that looks across the entire call for hints. These two scores are both between0 and 1, and therefore can be interpolatedto generateone uniquescore. Afteroptimizing the interpolationweightson a held-outset we obtaineda slightly higher weight (0.6) for the maximum entropy model. It can be seenin Table4 that the accuracy of the combinedsystemis greater that the accuracy of each individual system, suggestingthe complementarityof the two initialsystems.</Paragraph> </Section> </Section> class="xml-element"></Paper>