File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0604_intro.xml
Size: 7,629 bytes
Last Modified: 2025-10-06 14:01:53
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0604"> <Title>Towards a Framework for Learning Structured Shape Models from Text-Annotated Images</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Researchers in computer vision and computational linguistics have similar goals in their desire to automatically associate semantic information with the visual or linguistic representations they extract from an image or text. Given paired image and text data, one approach 0Wachsmuth is supported by the German Research Foundation (DFG). Stevenson and Dickinson gratefully acknowledge the support of NSERC of Canada.</Paragraph> <Paragraph position="1"> is to use the visual and linguistic representations as implicit semantics for each other--that is, using the words as names for the visual features, and using the image objects as referents for the words in the text (cf. Roy, 2002). The goal of our work is to automatically acquire structured object models from image data associated with text, at the same time learning an assignment of text labels for objects as well as for their subparts (and, in the long run, also for collections of objects).</Paragraph> <Paragraph position="2"> Multimodal datasets that contain both images and text are ubiquitous, including annotated medical images and the Corel dataset, not to mention the World Wide Web, allowing the possibility of associating textual and visual information in this way. For example, if a web crawler encountered many images containing a particular shape, and also found that the word chair was contained in the captions of those images, it might associate the shape with the word chair, simultaneously indicating a name for the shape and a visual &quot;definition&quot; for the word. Such a framework could then learn the class names for a set of shape classes, effectively yielding a translation model between image shapes (or more generally, features) and words (Duygulu et al., 2002). This translation model could then be used to answer many types of queries, including labeling a new image in terms of its visible objects, or generating a visual prototype for a given class name. Furthermore, since figure captions (or, in general, image annotations) may contain words for entire objects, as well as words for their component parts, a natural semantic hierarchy may emerge from the words. For example, just as tables in the image may be composed of &quot;leg&quot; image parts, the word leg can be associated with the word table in a part-whole relation.</Paragraph> <Paragraph position="3"> Others have explored the problem of learning associations between image regions (or features) and text, including Barnard and Forsyth (2001), Duygulu et al. (2002), Blei and Jordan (2002), and Cascia et al. (1998). As impressive as the results are, these approaches make limiting assumptions that prevent them from being appropriate to our goals of a structured object model. On the vision side, each segmented region is mapped one-to-one or one-to-many to words. Conceptually, associating a word with only one region prevents an appropriate treatment of objects with parts, since such objects may consistently be region-segmented into a collection of regions corresponding to those components.</Paragraph> <Paragraph position="4"> Practically, even putting aside the goal of part-whole processing, any given region may be (incorrectly) oversegmented into a set of subregions (that are not component parts) in real images. Barnard et al. (2003) propose a ranking scheme for potential merges of regions based on a model of word-region association, but do not address the creation of a structured object model from sequences of merges. To address these issues, we propose a more elaborate translation/association model in which we use the text of the image captions to guide us in structuring the regions.</Paragraph> <Paragraph position="5"> On the language side of this task, words have typically been treated individually with no semantic structure among them (though see Roy, 2002, which induces syntactic structure among the words). Multiple words may be assigned as the label to a region, but there's no knowledge of the relations among the words (and in fact they may be treated as interchangeable labels, Duygulu et al., 2002). The more restrictive goal of image labeling has put the focus on the image as the (structured) object. But we take an approach in principle of building a structured hierarchy for both the image objects and their text labels. In this way, we aim not only to use the words to help guide us in how to interpret image regions, but also to use the image structure to help us induce a part/whole hierarchy among the words. For example, assume we find consistently associated leg and top regions together referred to as a table. Then instead of treating leg and table, e.g., as two labels for the same object, we could capture the image part-whole structure as word relations in our lexicon.</Paragraph> <Paragraph position="6"> Our goal of inducing associated structured hierarchies of visual and linguistic descriptions is a long-term one, and this paper reports on our work thus far. We start with the probabilistic translation model of Brown et al. (1993) (as in Duygulu et al., 2002), and extend it to structured shape descriptions of visual data. As alluded to earlier, we distinguish between two types of structured shape descriptions: collections of regions that should be merged due to oversegmentation versus collections of regions that represent components of an object. To handle both types, we incorporate into our algorithm several region merge operations that iteratively evaluate potential merges in terms of their improvement to the translation model.</Paragraph> <Paragraph position="7"> These operations can exploit probabilities over region adjacency, thus constraining the potential combinatorial explosion of possible region merges. We also permit a many-to-many mapping between regions and words, in support of our goal of inducing structured text as well, although here we report only on the structured image model, assuming similar mechanisms will be useful on the text side.</Paragraph> <Paragraph position="8"> We are currently developing a system to demonstrate our proposal. The input to the system is a set of images segmented into regions organized into a region adjacency graph. Nodes in the graph encode the qualitative shape of a region using a shock graph (Siddiqi et al., 1999), while undirected edges represent region adjacency (used to constrain possible merges). On the text side, each image has an associated caption which is processed by a part-of-speech tagger (Brill, 1994) and chunker (Abney, 1991). The result is a set of noun phrases (nouns with associated modifiers) which may or may not pertain to image content. The output of the system is a set of many-to-many (possibly structured) associations between image regions and text words.</Paragraph> <Paragraph position="9"> This paper represents work in progress, and not all the components have been fully integrated. Initially, we have focused on the issues of building the structured image models. We demonstrate the ideas on a set of annotated synthetic scenes with both multi-part objects and over-segmented objects/parts. The results show that at least on simple scenes, the model can cope with oversegmentation and converge to a set of meaningful many-to-many (regions to words) mappings.</Paragraph> </Section> class="xml-element"></Paper>