File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/h93-1076_metho.xml

Size: 10,158 bytes

Last Modified: 2025-10-06 14:13:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="H93-1076">
  <Title>Speech and Text-Image Processing in Documents</Title>
  <Section position="4" start_page="376" end_page="376" type="metho">
    <SectionTitle>
3. AUDIO EDITING AND INDEXING
</SectionTitle>
    <Paragraph position="0"> A second example of signal-based doeurnent processing is provided by a wordspotter developed to support editing and indexing of documents which originate and are intended to remain in audio form\[12\]. Examples include voice mail, die2Syntax and semantics are not neeessmily preserved in the example, thanks to the user's lack of familiarity with most of the languages involved. tated instructions and pre-recorded radio broadcasts or commentaries. The wordspotter can also be used to retrieve relevant portions of less structured audio, such as recorded lectures or telephone messages. In contrast with most previous wordspotting applications (e.g., \[15, 10\]), unconstrained key-word vocabularies are critical to such editing and indexing tasks.</Paragraph>
    <Paragraph position="1"> The wordspotter is similar to Image Ernacs in at least three ways: 1) it is based on partial modeling of signal content; 2) it requires user specification of keyword models; and 3 ) it makes no explicit use of linguistic knowledge during recognition, though users are free to assign interpretations to keywords.</Paragraph>
    <Paragraph position="2"> The wordspotter is also speaker or, more accurately, soundsource dependent. These constraints allow for vocabulary and language independence, as well as for the spotting of non-speech audio sounds.</Paragraph>
    <Paragraph position="3"> The wordspotter is based on hidden Markov models (HlVlM's) and is trained in two stages\[13\]. The first is a static stage, in which a short segment of the user's speech (typically 1 minute or less) is used to create a background, or non-keyword, HMM. The second stage of trig is dynamic, in that key-word models are created while the system is in use. Model specification requires only a single repetition of a keyword and, thus, to the system user, is indistinguishable from key-word spotting. Spotting is performed using a HMM network consisting of a parallel connection of the background model and the appropriate keyword model. A forward-backward search\[13\] is used to identify keyword start and end times, both of which are required to enable editing operations such as keyword insertion and deletion.</Paragraph>
    <Paragraph position="4"> The audio editor is implemented on a Sun Microsystems Sparcstation and makes use of its standard audio hardware. A videotape demonstrating its use in several multilingual application scenarios is available from SIGGRAPH\[14\].</Paragraph>
  </Section>
  <Section position="5" start_page="376" end_page="377" type="metho">
    <SectionTitle>
4. DOCUMENT IMAGE DECODING
</SectionTitle>
    <Paragraph position="0"> Document image decoding (DID) is a framework for scanned document recognition which extends hidden Markov modeling concepts to two-dimensional image data\[4\]. In analogy with the HMM approach to speech recognition, the decoding framework assumes a communication theory model based on three dements: an image generator, a noisy channel and an image decoder. Tbe image generator consists of a message source, which generates a symbol string containing the informarion to be cornrnunicated, and an imager, which formats or encodes the message into an ideal bitmap. The channel transforms this ideal image into a noisy observed image by introducing distortions associated with printing and scanning.</Paragraph>
    <Paragraph position="1"> The decoder estimates the message from the observed image using maximum a posteriori decoding and a Viterbi-like decoding algorithm\[6\].</Paragraph>
    <Paragraph position="2">  A key objective being pursued within the DID framework is the automatic generation of optimized decoders from explicit models of message source, imager and channel\[3\]. The goal is to enable application-oriented users to specify such models without sophisticated knowledge of image analysis and recognition techniques. It is intended that both the format and type of information returned by the document decoder be under the user's control. The basic approach is to support declarative specification of a priori document infomation (e.g., page layout, font metrics) and task constraints via formal stochastic grarnmars.</Paragraph>
    <Paragraph position="3"> Figures 3 through 5 illustrate the application of DID to the extraction of subject headings, listing types, business names and telephone numbers from scanned yellow pages\[7\]. A slightly reduced version of a sample scanned yellow page column is shown in Figure 3 and a finite-state top-level source model in Figure 4. The yellow page column includes a subject heading and examples of several different listing types. These, in turn, are associated with branches of the source model. The full model contains more than 6000 branches and 1600 nodes.</Paragraph>
    <Paragraph position="4"> Figure 5 shows the result of using a decoder generated from the model to extract the desired information from the yellow page column. Automatically generated decoders have also been used to recgnize a variety of other document types, including dictionary entries, musical notation and baseball box scores.</Paragraph>
  </Section>
  <Section position="6" start_page="377" end_page="377" type="metho">
    <SectionTitle>
5. SUMMARY
</SectionTitle>
    <Paragraph position="0"> The speech and text-image recognition initiatives discussed in the preceding sections illustrate two research themes at Xerox PARC which expand and redefine the role of recognition technology in document-oriented applications. These include the development of editors which operate directly on audio and scanned image data, and the use of speech and text-image recognition to retrieve arbitrary information from documents with signal ccontent. Key concepts embodied in these research efforts include partial document models, task-oriented document recognition, user specification and intepretation of recognition models, and automatic generation of recognizers from declarative models. These concepts enable the realization of a broad range of signal-based decument processing operations, including font, vocabulary and language-independent editing and retrieval.</Paragraph>
  </Section>
  <Section position="7" start_page="377" end_page="378" type="metho">
    <SectionTitle>
References
</SectionTitle>
    <Paragraph position="0"> 1. Bagley, S. and Kopoc, G. 'TAi~g images of text&amp;quot;. Technical Report P92-000150, Xerox PARC, Palo Alto, CA, November, 1992.</Paragraph>
    <Paragraph position="1"> 2. Horn, B. Robot Hsion, The MIT Press, Cambridge, MA, 1986. 3. Kopee, G. and Chou, P. &amp;quot;Automatic generation of custom document irnage decoders&amp;quot;. Submitted to ICDAR'93: SecondIAPR Conference on Document Analysis and Recognition, Tsukuba Science City, Japan, October, 1993.</Paragraph>
    <Paragraph position="2"> 4. Kopee, G. and Chou, P. &amp;quot;Document image decoding using Markov sources&amp;quot;. To be presented at ICASSP-93, Minneapolis, MN, April 1993.</Paragraph>
    <Paragraph position="3"> 5. Kopec, G. &amp;quot;Least-squares font metric estimation fi'om images, EDL Report EDL-92-008, Xerox PARC, Palo Alto, CA, July, 1992.</Paragraph>
    <Paragraph position="4"> 6. Kopec, G. &amp;quot;Row-major scheduling of image decoders&amp;quot;. Submilled to IEEE Trans. on Image Processing, February, 1992. 7. Pacific Bell Smart Yellow Pages, Palo Alto, Redwood City and Menlo Park, 1992.</Paragraph>
    <Paragraph position="5"> 8. Price P., &amp;quot;Evaluation of Spoken Language Systems: The ATIS Domain;' Proc. Third DARPA Speech and Language Workshop, P. Price (ed.), Morgan Kauflnann, June 1990.</Paragraph>
    <Paragraph position="6"> 9. Price, P., Fisher, W., Bemstein, J. and PallelI, D. &amp;quot;The DARPA 1000-Word Resource Management Database for Continous Speech Recognition&amp;quot;. Proceedings of ICASSP-88, 1988, pp 651-654.</Paragraph>
    <Paragraph position="7"> 10. Rohlicek, R., Russeh W., Ronkos, S. and Gish, H. &amp;quot;Continuous hidden Markov modeling for speaker-independent word spotting&amp;quot;. Proceedings ICASSP-89, 1989, 627-630.</Paragraph>
    <Paragraph position="8"> 11. Stallman, R. GNU Emacs Manual, Free Software Foundation, Cambridge, MA, 1986.</Paragraph>
    <Paragraph position="9"> 12. W'dcox, L. and Bush, M. &amp;quot;HMM-based wordspotting for voice editing and audio indexing&amp;quot;. Proceedings of Eurospeech-91, Genova, Italy, 1991, pp 25-28.</Paragraph>
    <Paragraph position="10"> 13. W'fleox, L. and Bush, M. &amp;quot;Training and search algorithms for an interactive wordspotting system&amp;quot;. Proceedings of ICASSP-92, 1992, pp 11-97 - 11-100.</Paragraph>
    <Paragraph position="11"> 14. W'dcox, L., Smith, L and Bush, M. &amp;quot;Wordspotling for voice editing and indexing&amp;quot;. Proceedings of CHl &amp;quot;92, 1992, pp 655656. Video on SIGGRAPH Video Review 76-77.</Paragraph>
    <Paragraph position="12"> 15. W'flpon, J., Miller, L., and Modi, P.&amp;quot;Improvements and appficafions for key word recognition using hidden Markov modeling techniques&amp;quot;.ProceedingsoflCASSP-91, 1991, pp 309-312.  introduced to satisfy the need for numbers created by economic growth and increased demand for telecommunications services.</Paragraph>
    <Paragraph position="13"> The introduction of the new Area Codes...</Paragraph>
    <Paragraph position="14"> -- will not increase the cost of your calls, -- will not change your regular seven-digit telephone number.</Paragraph>
    <Paragraph position="15"> * We are telling you now so that stationery purchases and reprogramming of equipment can be planned.</Paragraph>
    <Paragraph position="16"> Para informaci6n en espafiol sobrc los eambios cfeetuados a los e6digos de irea 4151510 y 2131310, favor de Iiamar</Paragraph>
  </Section>
  <Section position="8" start_page="378" end_page="380" type="metho">
    <SectionTitle>
PRECISION PHONE
BUSINESS SYSTEMS
COMPLETE SYSTEMS PLUS
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
class="xml-element"></Paper>
Download Original XML