XML Viewer - c02-1098

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1098_metho.xml
Size: 18,010 bytes
Last Modified: 2025-10-06 14:07:50
<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1098">
  <Title>Annotation-Based Multimedia Summarization and Translation</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Multimedia Annotation
</SectionTitle>
    <Paragraph position="0"> Multimedia annotation is an extension of document annotation such as GDA (Global Document Annotation) (Hasida, 2002). Since natural language text is more tractable and meaningful than binary data of visual (image and movingpicture)andauditory(soundandvoice) content, we associate text withmultimedia content in several ways. Since most video clips contain spoken narrations, our system converts them into text and integrates them into video annotation data. The text in the multimedia annotation is linguistically annotated based on GDA.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Multimedia Annotation Editor
</SectionTitle>
      <Paragraph position="0"> We developed an authoring tool called multi-media annotation editor capable of video scene change detection, multilingual voice transcription, syntactic and semantic analysis of transcripts, and correlation of visual/auditory segments and text.</Paragraph>
      <Paragraph position="1">  windows. One window(top) shows a video content, automatically detected keyframes in the video, and an automatically generated voice transcript. The second window (left bottom) enablestheuserto editthetranscriptandmodifyanautomatically analyzed linguistic markup structure. The third window (right bottom) shows graphically a linguistic structure of the selected sentence in the second window.</Paragraph>
      <Paragraph position="2"> The editor is capable of basic natural language processing and interactive disambiguation. The user can modify the results of the automatically analyzed multimedia and linguistic (syntactic and semantic) structures.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Linguistic Annotation
</SectionTitle>
      <Paragraph position="0"> Linguistic annotation has been used to make digitaldocumentsmachine-understandable,and todevelopcontent-basedpresentation,retrieval, question-answering, summarization, and translation systems with much higher quality than is currently available. We have employed the GDA tagset as a basic framework to describe linguistic and semantic features of documents.</Paragraph>
      <Paragraph position="1"> The GDA tagset is based on XML (Extensible Markup Language) (W3C, 2002), and designed to be as compatible as possible with TEI (TEI, 2002), CES (CES, 2002), and EAGLES (EA-GLES, 2002).</Paragraph>
      <Paragraph position="2"> An example of a GDA-tagged sentence follows: null</Paragraph>
      <Paragraph position="4"> The &lt;su&gt; element is a sentential unit. The other tags above, &lt;n&gt;, &lt;np&gt;, &lt;v&gt;, &lt;ad&gt; and &lt;adp&gt; mean noun, noun phrase, verb, adnoun or adverb (including preposition and postposition), and adnominal or adverbial phrase, respectively. null The opr attribute encodes a relationship in which the current element stands with respect to the element that it semantically depends on. Its value denotes a binary relation, which may be a thematic role such as agent, patient, recipient, etc., or a rhetorical relation such as cause, concession, etc. For instance, in the above sentence, &lt;np opr=&amp;quot;agt&amp;quot; sem=&amp;quot;time0&amp;quot;&gt;Time&lt;/np&gt; depends on the second element &lt;v sem=&amp;quot;fly1&amp;quot;&gt;flies&lt;/v&gt;. opr=&amp;quot;agt&amp;quot; means that Time has the agent role with respect to the event denoted by flies. The sem attribute encodes a word sense.</Paragraph>
      <Paragraph position="5"> Linguistic annotation is generated by automatic morphological analysis, interactive sentence parsing, and word sense disambiguation by selecting the most appropriate item in the domain ontology. Some research issues on linguistic annotation are related to how the annotation cost can be reduced within some feasible levels. We have been developing some machineguided annotation interfaces to simplify the annotation work. Machine learning mechanisms also contribute to reducing the cost because they can gradually increase the accuracy of automatic annotation.</Paragraph>
      <Paragraph position="6"> In principle, the tag set does not depend on language, but as a first step we implemented a semi-automatic tagging system for English and Japanese.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Video Annotation
</SectionTitle>
      <Paragraph position="0"> The linguistic annotation technique has an important role in multimedia annotation. Our video annotation consists of creation of text data related to video content, linguistic annotation of the text data, automatic segmentation of video, semi-automatic linking of video segments with corresponding text data, and interactive naming of people and objects in video scenes.</Paragraph>
      <Paragraph position="1"> To be more precise, video annotation is performed through the following three steps.</Paragraph>
      <Paragraph position="2"> First, for each video clip, the annotation system creates the text corresponding to its content. We developed a method for creation of voice transcripts using speech recognition engines. It is called multilingual voice transcription and described later.</Paragraph>
      <Paragraph position="3"> Second, some video analysis techniques are applied to characterization of visual segments (i.e., scenes) and individual video frames. For example, bydetecting significant changes in the color histogram of successive frames, frame sequences can be separated into scenes.</Paragraph>
      <Paragraph position="4"> Also, by matching prepared templates to individual regions in the frame, the annotation system identifies objects. The user can specify significant objects in some scene in order to reduce the time to identify target objects and to obtain a higher recognition accuracy. The user can nameobjects ina frame simplybyselecting words in the corresponding text.</Paragraph>
      <Paragraph position="5"> Third,theuserrelates video segments to text segments such as paragraphs, sentences, and phrases, based on scene structures and objectname correspondences. The system helps the user select appropriate segments by prioritizing them based on the number of the detected objects, camera motion, and the representative frames.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.4 Multilingual Voice Transcription
</SectionTitle>
      <Paragraph position="0"> The multimedia annotation editor first extracts the audio data from a target video clip. Then, the extracted audio data isdividedinto left and right channels. If the average for the difference of the audio signals of the two channels exceeds a certain threshold, they are considerd different  andtransferedtothemultilingualspeechidentification and recognition module. The output of the module is a structured transcript containing time codes, word sequences, and language information. It is described in XML format as shown in Figure 2.</Paragraph>
      <Paragraph position="1"> &lt;transcript lang=&amp;quot;en&amp;quot; channel=&amp;quot;l&amp;quot;&gt;</Paragraph>
      <Paragraph position="3"> Ourmultilingual video transcriptor automatically generates transcriptswith time codes and provides their reusable data structure which allows easy manual corretion. An example screen of the mulitilingual voice transcriptor is shown in Figure 3.</Paragraph>
      <Paragraph position="4">  and Recognition The progress of speech recognition technology makes it comparatively easy to transform speech into text, but spoken language identification is needed for processing multilingual speech, because speech recognition technology assumes that the language used is known.</Paragraph>
      <Paragraph position="5"> While researchers have been working on the multilingual speech identification, few applicationsbasedonthistechnologyhasbeenactually null used except a telephony speech translation system. In the case of the telephone translation system, the information of the language used is self-evident; at least, the speaker knows; so  therearelittleneedsandadvantagesofdeveloping a multilingual speech identification system. On the other hand, speech data in video do not always have the information about the languageused. Duetotherecentprogressofdigital broadcasting and the signal compression technology, the information about the language is expected to accompany the content in the future. But most of the data available now do nothave it, soalarge amountoflaborisneeded to identify the language. Therefore, the multi-lingual speech identification has a large part to play with unknown-language speech input.</Paragraph>
      <Paragraph position="6"> Aprocessofmultilingualspeechidentification is shown in Figure 4. Our method determines the language of input speech usinga simple discriminant function based on relative scores obtainedfrommultiplespeechrecognizersworking null in parallel (Ohira et al., 2001).</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Identification Unit
</SectionTitle>
      <Paragraph position="0"> Multiple speech recognition engines work simultaneously on the input speech. It is assumed that each speech recognition engine has  thespeakerindependentmodel,andeachrecognitionoutputwordhasascorewithinaconstant null range dependent on each engine.</Paragraph>
      <Paragraph position="1"> When a speech comes, each recognition engine outputs a word sequence with scores. The discriminant unit calculates a value of a discriminant function using the scores for every language. The engine with the highest average discriminant value is selected and the language is determined by the engine, whose recognition result is accepted as the transcript. If there is no distinct difference between discriminant values, that is not higher than a certain threshold, a judgment is entrusted to the user.</Paragraph>
      <Paragraph position="2"> Our technique is simple, it uses the existing speech recognition engines tuned in each language without a special model for language identification and acoustic features.</Paragraph>
      <Paragraph position="3"> Combining the voice transcription and the video image analysis, our tool enables users to create and edit video annotation data semiautomatically. The entire process is as shown in Figure 5.</Paragraph>
      <Paragraph position="4">  Our system drastically reduces the overhead on the user who analyzes and manages a large collection of video content. Furthermore, it makesconventional naturallanguageprocessing techniques applicable to multimedia processing.</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.5 Scene Detection and Visual Object
Tracking
</SectionTitle>
      <Paragraph position="0"> As mentioned earlier, visual scene changes are detected by searching for significant changes in the color histogram of successive frames. Then, framesequencescanbedividedintoscenes. The scene description consists of time codes of the start and end frames, a keyframe (image data in JPEG format) filename, a scene title, and some text representing topics. Additionally, when the user specifies a particular object in a frame by mouse-dragging a rectangular region, an automatic object tracking is executed and timecodesandmotiontrailsintheframe(series of coordinates for interpolation of object movement) are checked out. The user can name the detected visualobjects interactively. Thevisual objectdescriptionincludestheobjectname,the relatedURL,timecodesandmotiontrailsinthe frame.</Paragraph>
      <Paragraph position="1"> Our multimedia annotation also contains descriptionsonauditoryobjects invideo. Theauditory objects can bedetected by acoustic analysisontheuserspecifiedsoundsequencevisual- null izedinwaveform. Anexamplescenedescription inXML format isshowninFigure 6, andanexample object description in Figure 7.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Multimedia Summarization and
Translation
</SectionTitle>
    <Paragraph position="0"> Based on multimedia annotation, we have developed a system for multimedia (especially,  ate an interactive HTML (HyperText Markup Language) document from multimedia content with annotation data for interactive multimediapresentation, whichconsistsofanembedded video player, hyperlinked keyframe images, and linguistically-annotated transcripts. Our summarization and translation techniques are applied to the generated document called a multi-modal document.</Paragraph>
    <Paragraph position="1"> There are some previous work on multimedia summarization such as Informedia (Smith and Kanade, 1995) and CueVideo (Amir et al., 1999). They create a video summary based on automatically extracted features in video such as scene changes, speech, text and human faces in frames, and closed captions. They can process video data without annotations. However, currently, the accuracy of their summarization is not for practical use because of the failure of automaticvideoanalysis. Ourapproachtomultimediasummarizationattainssufficientquality null  foruseifthedatahasenoughsemanticinformation. As mentioned earlier, we have developed a tool to help annotators to create multimedia annotation data. Since our annotation data is declarative, hence task-independent and versatile, the annotations are worth creating if the multimedia content will be frequently used in different applications such as automatic editing and information extraction.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Multimodal Document
</SectionTitle>
      <Paragraph position="0"> Video transformation is an initial process of multimedia summarization and translation.</Paragraph>
      <Paragraph position="1"> The transformation module retrieves the annotationdataaccumulatedinanannotationrepos- null itory (XML database) and extracts necessary information to generate a multimodal document. The multimodal document consists of an embedded video window, keyframes of scenes, and transcipts aligned withthe scenes as shown in Figure 8. The resulting document can be summarized and translated by the modules explained later.</Paragraph>
      <Paragraph position="2">  This operation is also beneficial for people with devices without video playing capability. In this case, the system creates a simplifiedversion ofmultimodal documentcontaining only keyframe images of important scenes and summarized transcripts related to the selected scenes.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Video Summarization
</SectionTitle>
      <Paragraph position="0"> The proposed video summarization is performed as a by-product of text summarization. The text summarization is an application of linguistic annotation. The method is cohesion-based and employs spreading activation to calculate the importance values of wordsand phrasesinthe document(Nagao and Hasida, 1998).</Paragraph>
      <Paragraph position="1"> Thus, the video summarization works in terms of summarization of a transcript from multimedia annotation data and extraction of the video scene related to the summary. Since a summarized transcript contains important words and phrases, corresponding video sequences will produce a collection of significant scenes in the video. The summarization results in a revised version of multimodal documemt that contains keyframe images and summarized transcripts of selected important scenes. Keyframes of less important scenes are shown in a smaller size. An example screen of a summarized multimodal document is shown in Fig- null The vertical time bar in the middle of the screenofmultimodaldocumentrepresentsscene segments whose color indicates if the segment is included in the summary or not. The keyframe images are linked with their corresponding scenes so that the user can see the scene by just clicking its related image. The user can also access information about objects such as people in the keyframe by dragging a rectangular region enclosing them. The informationappearsinexternalwindows. Inthecase of auditory objects, the user can select them by clicking any point in the time bar.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Video Translation
</SectionTitle>
      <Paragraph position="0"> One type of our video translation is achieved through the following procedure. First, transcripts in the annotation data are translated into different languages for the user choice, and then, the results are shown as subtitles synchronized with the video. The video translation module invokes an annotation-based text translation mechanism. Text translation is also greatly improved by using linguistic annotation (Watanabe et al., 2002).</Paragraph>
      <Paragraph position="1"> The other type of translation is performed in terms of synchronization of video playing and speech synthesis of the translation results. This translation makes another-language version of the original video clip. If comments, notes, or keywords are included in the annotation data on visual/auditory objects, then they are also translated and shown on a popup window.</Paragraph>
      <Paragraph position="2"> In the case of bilingual broadcasting, since our annotation system generates transcripts in everyaudiochannel,multimodaldocumentscan be coming from both channels. The user can easily select a favorite multimodal document created from one of the channels. We have also developed a mechanism to change the language to play depending on the user profile that describes the user's native language.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML