File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/p03-2015_intro.xml

Size: 11,275 bytes

Last Modified: 2025-10-06 14:01:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-2015">
  <Title>Noriyoshi</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> In Japan, the television reception environment has become quite diverse in recent years. In addition to analog broadcasts, BS (Broadcast Satellite) digital television and data broadcasts have been operating since 2000. At the same time, TV operations for receiving such broadcasts are becoming increasingly complex, and an ever increasing variety of peripheral devices such as video tape recorders, disk recorders, DVD players, and game consoles are now being connected to televisions, and operating such devices with different kinds of interfaces is becoming troublesome not only for the elderly but for general users as well (Komine et al., 2000).</Paragraph>
    <Paragraph position="1"> Recently we conducted a usability test targeting data broadcasts in BS digital broadcasting. The results of the test revealed that many subjects had trouble accessing hierarchically arranged data.</Paragraph>
    <Paragraph position="2"> This finding revealed the need for an easy means of accessing desired programs. One such means is a spoken natural language dialogue (hereafter spoken dialogue) interface for TV operations. If spoken dialogue could be used to select and search for programs, to operate peripheral devices, and to give information in reply to system queries, we can envisage such an interface as being extremely valuable in a multi-channel and multiservice function viewing environment. With this in mind, we have set out to build an interface system that could operate a television via spoken dialogue in place of manual operations.</Paragraph>
    <Paragraph position="3"> 2 Collecting dialogue data for TV operations null Assuming that a television is intelligent enough to understand the words spoken by a human, what kind of language expressions would a user use to give commands to that television? In other words, it is important that the words spoken by a user in such a situation be carefully examined when designing a television interface using spoken dialogues. Therefore first we built an experimental environment that would enable us to collect dialogue data based on WOZ (Wizard of OZ) method.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Wizard of OZ
</SectionTitle>
      <Paragraph position="0"> We set up a television-operation environment according to the WOZ framework in which the subjects were instructed that &amp;quot;the character appearing on the television screen can understand anything you say, and that the character will operate the television for you.&amp;quot; The number of channels that could be selected was 19, and screens displaying Electronic Program Guide (EPG) and user interface for program searching were presented as needed (Komine et al., 2002).</Paragraph>
      <Paragraph position="1"> This WOZ environment required two operators, one in charge of voice responses and the other of user interface operations. The voice-response operator returns a voice response to the subject by a speech synthesizer after selecting a reply from about 50 previously prepared statements or inputting replies directly from a keyboard. If the subject happens to be silent, the operator returns a response that introduces new services or prompts the subject to say something. The user interface operator first determines what the subject wants, and then manipulates user interface or EPG and performs basic television operations such as changing channels.</Paragraph>
      <Paragraph position="2"> The subjects selected for data collection consisted of 10 men and 10 women ranging in age from 24 to 31 (average age: 28.7), and each was allowed to speak freely with the television for 5 minutes under an assumption that the &amp;quot;television has a certain amount of intelligence.&amp;quot;</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Results of data analysis
</SectionTitle>
      <Paragraph position="0"> Figure 1 shows an example of dialogue data recorded during a WOZ session. On analyzing collected utterances made by the subjects (1,268 utterances in total), it was found that 83% of user utterances concerned requests made to the television, and that 89% of those requests included words belonging to specific categories such as program title, genre, performer, station, time, and TV operation commands. The remaining 17% of utterances did not concern the system but were rather a result of subjects talking or muttering to themselves for self-confirmation and the like.</Paragraph>
      <Paragraph position="1"> Here, we consider the following reason why most utterances belonged to specific categories despite the fact that a variety of request could be made. In this system, TV program- and operationrelated information is displayed on the television screen, and based on this information, subjects tended to underestimate television capability and to omit utterances not dealing with service functions they saw as possible. It is also thought that the conventional image of television inside subjects' minds served to restrict user utterances.</Paragraph>
      <Paragraph position="2"> As a part of this WOZ experiment, we also had the subjects fill out a questionnaire with regards to television operations by using spoken dialogue interface. When asked to give an opinion on operating a television by voice, more than half replied &amp;quot;Yes, I would like to&amp;quot; therefore apparently indicating a high demand for the spoken dialogue interface. On the other hand, most subjects that replied &amp;quot;No, I would not like to&amp;quot; gave simple embarrassment at speaking out loud as one reason and a reluctance to vocalize commands when watching television together with their families as another.</Paragraph>
      <Paragraph position="3"> In this regard, we think that embarrassment could probably be reduced through user experience and appropriate environment configuration.</Paragraph>
      <Paragraph position="4"> 3 Spoken dialogue interface system for TV operations Based on the results of the data analysis, we built a prototype system that enables television operations via spoken dialogue. Figure 2 shows the configuration of this system. The system allows users to select real-time broadcast programs from 19 channels.</Paragraph>
      <Paragraph position="5"> It also enables the presentation of program in- null would like to see.</Paragraph>
      <Paragraph position="6"> 01:08:27 Subject Well, I would like see more at the bottom of the screen.</Paragraph>
      <Paragraph position="7"> 01:12:09 WOZ OK, I will do it.</Paragraph>
      <Paragraph position="8"> 01:15:23 Subject Um, Just a little bit more.</Paragraph>
      <Paragraph position="9"> 01:17:27 WOZ OK, how's that?</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Robot interface
</SectionTitle>
      <Paragraph position="0"> The user makes operation requests to interface robot (IFR) as shown in Figure 3, and the IFR operates the television accordingly for the user. The IFR is equipped with a super-unidirectional microphone and a speaker, and communicates and activates the speech recognition and voice synthesis, and dialogue processing of the system. The IFR has been given the appearance of a stuffed animal.</Paragraph>
      <Paragraph position="1"> One advantage of this IFR is that it can be directly touched and manipulated to create a feeling of warmth and closeness.</Paragraph>
      <Paragraph position="2"> On hearing a greeting or being called by its name, the IFR opens its eyes and enters a state that can perform various operations. For example, the IFR can assist the user search for a program, can present information about any program on the television screen, and can return voice responses.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Speech recognition
</SectionTitle>
      <Paragraph position="0"> The speech recognition module uses an algorithm that can finalize recognition results in a sequential manner for a real-time operation and a high speech recognition rate. When applying this module to a news program, a speech recognition rate of about 95% can be obtained (Imai, 2000).</Paragraph>
      <Paragraph position="1"> In speech that occurs during television operations, the words such as program titles, names of broadcast stations, names of entertainers and etc.</Paragraph>
      <Paragraph position="2"> have a high probability of occurring and are also updated frequently. For this reason, newly acquired word-lists are automatically registered in a dictionary on a daily basis. In addition, as program titles often consist of multiple words, it is necessary to register them as a single word in order to improve the recognition rate.</Paragraph>
      <Paragraph position="3"> Despite several additional forms of tuning, it is still difficult to achieve perfect results with current speech recognition technology. To enable feedback to be given to the user at the time of erroneous recognition, results of recognition are always displayed on the lower left corner of the television screen.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Dialogue processing
</SectionTitle>
      <Paragraph position="0"> In dialogue processing, it is generally difficult to understand intent by performing only a lexical analysis of speech. If we limit tasks to dialogue used in television operation, the words spoken by a user have a high probability of falling into specific categories such as program name, as indicated by the results of the data analysis described in 2.2. As a consequence, user intent can be inferred from a combination of specific categories and predicates.</Paragraph>
      <Paragraph position="1"> From the viewpoint of processing speed, processing can be performed in real time if we use patternbase approach. This approach is also used in other dialogue systems such as PC-based agent television systems in the (FACTS) project and (Sumiyoshi et al., 2002).</Paragraph>
      <Paragraph position="2"> The dialogue processing module performs real-time morphological analysis of input statements from the speech recognition module. A statement is then identified by pattern matching in units of morphemes and the meaning ascribed beforehand to that statement is obtained. An example of such pattern is shown in Figure 4 using the metacharacters listed in Table 1:  In the pattern matching process, categories important to television operations are stored as slots. Table 2 lists these category-slots and examples of their members. The words stored in these slots are then used as a basis for generating television operation commands and search expressions to access the TV program database. Response statements to input statements may take various forms depending on the patterns and current circumstances, and they are here generated by taking into account slot information, response history, results of searching for program information.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML