File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/85/e85-1035_metho.xml
Size: 17,230 bytes
Last Modified: 2025-10-06 14:11:42
<?xml version="1.0" standalone="yes"?> <Paper uid="E85-1035"> <Title>AUTOMATED SPEECH RECOGNITION: A FRAMEWORK FOR RESEARCH</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> AUTOMATED SPEECH RECOGNITION: A FRAMEWORK FOR RESEARCH </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="241" type="metho"> <SectionTitle> ABSTRACT </SectionTitle> <Paragraph position="0"> This paper reflects the view that the decoding of speech, either by computer systems or people, must to a large extent be determined by the ways in which the speaker has encoded the information necessary for its comprehension. We therefore place great emphasis on the use of psycholinguistics as a tool for the construction of models essential to the characterisation of the speech understanding task.</Paragraph> <Paragraph position="1"> We are primarily concerned with the interactions between the various levels at which a fragment of speech can be described (e.g. acoustic-phonetic, lexical, syntactic, etc), and the ways in which the knowledge bases associated with each of these &quot;levels&quot; contribute towards a final interpretation of an utterance. We propose to use the Chart Parser as a general computational framework for simulating such interactions, since its flexibility allows various models to be implemented and evaluated.</Paragraph> <Paragraph position="2"> Within this general framework we discuss problems of information flow and search strategy in combining evidence across levels of description and across time, during the extension of an hypothesis. We stress the importance of both psychological and computational theory in developing a particular control strategy which could be implemented within the framework.</Paragraph> <Paragraph position="3"> Introduction The decoding of speech, either by computer systems or people, must to a large extent be determined by the ways in which the speaker has encoded the information necessary for its comprehension. Such a view is supported by a large body of experimental evidence concerning the ways in which various factors (eg. predictability from context) affect both the acoustic clarity with which a speaker pronounces an utterance, and the strategy the hearer appears to use in identifying it. The task of the Edinburgh EHB 9LL. (GB) compu-PSer system is to mimic, though preferably model, this strategy. In order to do so, one should presumably draw on both computational and psychological theories of process. Such a dual approach has been shown to be feasible, and indeed desirable, by research into early visual processing (eg. Marr 1976) which has shown that there can come a point when psychological and computational descriptions become barely distinguishable. This analogy with early visual processing is significant because central to the development of the vision research was the notion of 'modelling': one can argue that a significant difference between the so-called '4th Generation' and '5th GeneraPSion' technologies is that with the former ad-hoc algorithms are applied to often incomplete and unreliable data, while with 5th Generation systems, algorithms are devised by first constructing qualitative models suited to the task domain.</Paragraph> <Paragraph position="4"> We propose to use psycholinguistics as a tool for the construction of models essential to the characterisation of the speech understanding task. We believe that this approach is essential to the development of automated speech recognition systems, and will also prove beneficial to psychological models of human speech processing, the majority of which are underdetermined from a computational point of view. Rumelhart and McClelland have recently adopted a similar approach to account for the major findings in the psychological literature on letter perception. By constructing a detailed computational model of the processes involved they were able to give an alternative description of the recognition of certain letter strings, which was supported by subsequent psycholinguistic experiments. Rumelhart and McClelland emphasise the point that their results were not predictable 'on paper', but were the outcome of considerable experimentation with the computational model.</Paragraph> <Section position="1" start_page="239" end_page="241" type="sub_section"> <SectionTitle> Requirements of the Computational Framework </SectionTitle> <Paragraph position="0"> The experience of the ARPA speech project, which resulted in the design of a number of speech recognition systems, has demonstrated that the task of controlling the interactions between the knowledge bases which make up the system is at least as problematic as that of defining the knowledge bases. Major inadequacies in the systems developed during the ARPA project can be attributed to an early commitment in one or more areas of design which were not apparent until final testing and evaluation of the complete system began. An architecture is required, therefore, which will permit the development in parallel and relatively independently of component knowledge bases and methods of deploying them computationally. It should also permit the evaluation and testing of solutions with partially specified or simulated components. This will ensure that the design of one component will not influence unduly the design of any other component, possibly to the detriment of both. In addition, we should have the ability to determine the consequences of component design decisions by testing their contributions to the overall goals of speech recognition.</Paragraph> <Paragraph position="1"> In order to fulfill these requirements we propose to use the active chart parser (e.g. Thompson & Ritchie, 1984). This was specifically designed as a flexible framework for the implementation (both serial and parallel) of different rule systems, and the evaluation of strategies for using these rule systems. It is described below in more detail.</Paragraph> <Paragraph position="2"> The Computational Model The problem in designing optimal control or search strategies lies in combining evidence across different levels of description (e.g. acoustic-phonetic, morpho-phonemic, syntactic, etc.), and across time during the extension of a hypothesis, such that promising interpretations are given priority and the right one wins. In this section we shall consider just a few of the issues concerning this flow of information.</Paragraph> <Paragraph position="3"> Automated speech systems, in particular those implemented during the ARPA-SUR project, have been forced to confront the errorfull and ambiguous nature of speech, and to devise methods of controlling the very large search space of partial interpretations generated during processing. Although the problem was exacerbated by the poor performance of the acoustic-phonetic processing used in these systems, the experimental evidence suggests that the solution will not be found simply by improving techniques for low-level feature detection. The situation appears to be analogous to that of visual processing, where &quot;significant&quot; features may be absent. If present, their significance may also be open to a number of interpretations.</Paragraph> <Paragraph position="4"> Combining evidence across different levels of description requires the specification of information flow between these levels. Within the psychological literature, there is a growing tendency away from &quot;strong&quot; (or &quot;instructive&quot;) interactions towards &quot;weak&quot; (or &quot;selective&quot;) interactions. With the latter, the only permissible flow of information involves the filtering out, by one component, of alternatives produced by other components (cf. Marslen-Wilson & Tyler, 1980; Crain & Steedman, 1982; Altmann & Steedman, forthcoming), so in hierarchical terms no component determines what is produced by any other component beneath it. A strong interaction, on the other hand, allows one component to direct, or guide, @ctively a second component in the pursuit of a particular hypothesis. Within the computational literature, weak interactions are also argued for on &quot;aesthetic&quot; grounds such as Marr's principles of modularity and least commitment (Mart, 1982).</Paragraph> <Paragraph position="5"> The strongly interactive heterarchical and blackboard models implemented in HWIM and Hearsay II respectively have been criticised for the extremely complex control strategies which they required. Problems arise with the heterarchical model &quot;because of the difficulties of generating each of the separate interfaces and, more importantly, because of the necessity of specifying the explicit control scheme.&quot; (Reddy & Erman, 1975). Similar problems arise with existing blackboard models. Their information flow allows strong top-down direction of components, resulting once again in highly complex control strategies. Hierarchical models have other problems, in that they allow too little interaction between the knowledge sources: within a strictly hierarchical system, one cannot &quot;interleave&quot; the processes associated with each different level of knowledge, and hence one cannot allow the very early filtering out by higher-level components of what might only be partial analyses at lower levels. This situation (considered disadvantageous for reasons of speed and efficiency) arises because of the lack of any common workspace over which the separate components can operate. There is, however, much to be said for hierarchical systems in terms of the relative simplicity of the control strategies needed to manage them, a consideration which is fundamental to the design of any speech recognition system.</Paragraph> <Paragraph position="6"> The model currently being developed embodies a weak hierarchical interaction, since this seems most promising on both psychological and computational grounds.</Paragraph> <Paragraph position="7"> Unlike existing hierarchical or associative models, it uses a uniform global data structure, a &quot;chart&quot;. Associated with this structure is the active chart parser.</Paragraph> <Paragraph position="8"> The active chart parser consists of the following:-I) A uniform global data structure (the Chart), represents competing pathways through a search space, at different levels of description, and at different stages of analysis. Complete descriptions are marked by &quot;inactive&quot; paths, called edges, spanning temporally defined portions of the utterance. These inactive edges have pointers to the lower level descriptions which support them. Partial descriptions are marked by &quot;active&quot; edges which carry representations of the data needed to complete them. For example, a syntactic edge, such as a noun phrase, may span any complete descriptions that partially support it, such as a determiner or adjective. In addition, it will carry a description of the syntactic properties (e.g. noun) any inactive lexical edge must have to count both as additional evidence for this syntactic description and as justification for its extension or completion. The type and complexity of the descriptions are determined by the rule based knowledge systems used by the parser, and are not determined by the parser itself.</Paragraph> <Paragraph position="9"> 2) A multi-level task queueing structure (the Agenda), which is used to order the ways in which the descriptions will be extended, through time and level of abstraction, and thus to control the size and direction of the search space. This ordering on the agenda is controlled by specifically designed search strategies which determine the minimum amount of search compatible with a low rate of error in description. The power and flexibility of this approach in tackling complex system building tasks is well set out in Bobrow et al. 1976).</Paragraph> <Paragraph position="10"> 3) An algorithm which automatically schedules additions to the Chart onto the Agenda for subsequent processing wherever such extensions are possible. That is to say, whenever a description which is complete at some level (an inactive edge) can be used to extend a partial description at some higher level (an active edge). The knowledge bases define what extensions are possible, not the parser.</Paragraph> <Paragraph position="11"> To summarize, the chart is used to represent and extend pathways, through time and level of abstraction, through a search space. Within the chart, there are different types of path corresponding to different levels of description, each of which is associated with a particular knowledge source. To the extent that knowledge specific rules specify what counts as constituent pathways at the different levels of abstraction, a hierarchical flow in information is maintained. The weak interaction arises because alternative pathways at one level of description can be filtered through attempts to build pathways at the next &quot;higher&quot; level. This model differs from straightforward hierarchical models, but resembles associative models, in that knowledge sources contribute to processing without each source necessarily corresponding to a distinct stage of analysis in the processing sequence.</Paragraph> <Paragraph position="12"> Having sketched the construction of the search space we must now decide upon a strategy for exploring that space. Most current psychological theories appear to assume strict &quot;left-to-right&quot; processing, although this requires tackling stretches of sound immediatedly which are of poor acoustic quality, and which are relatively unconstrained by higher level knowledge.</Paragraph> <Paragraph position="13"> The majority of systems developed during the ARPA project found it necessary to use later occurring information to disambiguate earlier parts of an utterance. Moreover, there is psycholinguistic evidence that the &quot;intelligibility&quot; of a particular stretch of sound increases with additional evidence from later &quot;rightward&quot; stretches of sound (Pollack & Pickett, 1963; Warren & Warren, 1970). We propose to adopt a system using a form of left-to-right analysis which could approximate to the power of middle-out analysis (used in HWIM and Hearsay II) but without requiring the construction of distinct &quot;islands&quot; and with less computational expense. This more precise method of using &quot;rightcontext effects&quot; depends on the priority scores assigned to paths. Such scores can be thought of, for present purposes, as some measure of &quot;goodness of fit&quot;. The score on a spanning pathway (that is, a pathway which spans other pathways &quot;beneath&quot; it) is determined by the scores on its constituents, and so is partly determined by scores towards its right-hand end. By virtue of affecting the &quot;spanning score&quot;, a score on one sub-path can affect the probability that another sub-path to its left (as well as to its right) will finally be chosen as the best description for the acoustic segment it represents. We will use psycholinguistic techniques to interrogate the &quot;expert&quot; (i.e. statistically reliable experiments with human listeners), in order to determine both when such leftwards flowing information is most often used for the disambiguation of poor quality areas, and what sets of paths it will affect. It will be extremely useful to know whether people regularly rely on information from the right to disambiguate preceding stretches of sound, or whether this happens only at the beginning of utterances as the HWIM strategy suggests.</Paragraph> <Paragraph position="14"> Pollack and Pickett claim that there is no effect on intelligibility of a word's position within a stimulus, but unfortunately they offer no inferential statistics to back this claim.</Paragraph> <Paragraph position="15"> This is only one of the many issues in speech recognition which are experimentally addressable. The results of such experiments are obviously of relevance to computational systems since they can indicate where and when sources of information are most likely to contribute towards identification of an utterance. Conversely, the attempt to build a working model of at least some parts of the process, will highlight many areas where further experimental data is needed.</Paragraph> <Paragraph position="16"> Concluding Remark We hope that this sketch of part of the proposed system has given a feel for the combined approach taken here. It developed through a re-examination of a number of issues which arose during the ARPA speech project, and a reconsideration of these issues in the light of recent computational and psycholinguistic advances. Given the success of these recent advancements in the contributing fields of research, we feel that the time is right for the evaluation of a speech recognition system along the lines laid down here.</Paragraph> </Section> </Section> class="xml-element"></Paper>