File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0604_metho.xml

Size: 21,884 bytes

Last Modified: 2025-10-06 14:08:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0604">
  <Title>Towards a Framework for Learning Structured Shape Models from Text-Annotated Images</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Visual Shape Description
</SectionTitle>
    <Paragraph position="0"> In order to learn structured visual representations, we must be able to make meaningful generalizations over image regions that are sufficiently similar to be treated as equivalent. The key lies in determining categorical shape classes whose definitions are invariant to within-class shape deformation, color, texture, and part articulation. In previous work, we have explored various generic shape representations, and their application to generic object recognition (Siddiqi et al., 1999; Shokoufandeh et al., 2002) and content-based image retrieval (Dickinson et al., 1998). Here we draw on our previous work, and adopt a view-based 3-D shape representation, called a shock graph, that is invariant to minor shape deformation, part articulation, translation, rotation, and scale, along with minor rotation in depth.</Paragraph>
    <Paragraph position="1"> The vision component consists of a number of steps. First, the image is segmented into regions, using the mean-shift region segmentation algorithm of Comaniciu and Meer (1997).1 The result is a region adjacency graph, in which nodes represent homogeneous 1The results presented in Section 4.2 are based on a synthetic region segmentation. When working with real images, we plan to use the mean-shift algorithm, although any region segmentation algorithm could conceivably be used.</Paragraph>
    <Paragraph position="2">  computed shock points of a 2-D closed contour; and (c) the resulting shock graph.</Paragraph>
    <Paragraph position="3"> regions, and edges capture region adjacency. The parameters of the segmentation algorithm can be set so that it typically errs on the side of oversegmentation (regions may be broken into fragments), although undersegmentation is still possible (regions may be merged incorrectly with their neighbors). Next, the qualitative shape of each region is encoded by its shock graph (Siddiqi et al., 1999), in which nodes represent clusters of skeleton points that share the same qualitative radius function, and edges represent adjacent clusters (directed from larger to smaller average radii). As shown in Figure 1(a), the radius function may be: 1) monotonically increasing, reflecting a bump or protrusion; 2) a local minimum, monotonically increasing on either side of the minimum, reflecting a neck-like structure; 3) constant, reflecting an elongated structure; or 4) a local maximum, reflecting a disk-like or blob-like structure. An example of a 2-D shape, along with its corresponding shock graph, is shown in Figures 1(b) and (c).</Paragraph>
    <Paragraph position="4"> The set of all regions from all training images are clustered according to a distance function that measures the similarity of two shock graphs in terms of their structure and their node attributes. As mentioned above, the key requirement of our shape representation and distance is that it be invariant to both within-class shape deformation as well as image transformation. We have developed  a matching algorithm for 2-D shape recognition. As illustrated in Figure 2, the matcher can compute shock graph correspondence between different exemplars belonging to the same class.</Paragraph>
    <Paragraph position="5"> During training, regions are compared to region (shape) class prototypes. If the distance to a prototype is small, the region is added to the class, and the prototype recomputed as that region whose sum distance to all other class members is minimum. However, if the distance to the nearest prototype is large, a new class and prototype are created from the region. Using the region adjacency graph, we can also calculate the probability that two prototypes are adjacent in an image. This is typically a very large, yet sparse, matrix.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Learning of Translation Models
</SectionTitle>
    <Paragraph position="0"> The learning of translation models from a corpus of bilingual text has been extensively studied in computational linguistics. Probabilistic translation models generally seek to find the translation string e that maximizes the probability Pra5 ea6fa7 , given the source string f (where f referred to French and e to English in the original work, Brown et al., 1993). Using Bayes rule and maximizing the numerator, the following equation is obtained:</Paragraph>
    <Paragraph position="2"> The application of Bayes rule incorporates Pra5 ea7 into the formula, which takes into account the probability that ^e is a correct English string.</Paragraph>
    <Paragraph position="3"> Pra5 fa6ea7 is known as the translation model (prediction of f from e), and Pra5 ea7 as the language model (probabilities over e independent of f). Like others (Duygulu et al., 2002), we will concentrate on the translation model; taking f as the words in the text and e as the regions in the images, we thus predict words from image regions. However, we see the omission of the language model component, Pra5 ea7 (in our case, probabilities over the &amp;quot;language&amp;quot; of images--i.e., over &amp;quot;good&amp;quot; region associations), as a shortcoming. Indeed, as we see below, we insert some simple aspects of a &amp;quot;language model&amp;quot; into our current formulation, i.e. using the region adjacency graph to restrict possible merges, and using the a priori probability of a region Pra5 ra7 if translating from words to regions. In future work, we plan to elaborate the Pra5 ea7 component more thoroughly.</Paragraph>
    <Paragraph position="4"> Data sparseness prevents the direct estimation of Pra5 fa6 ea7 (which predicts one complete sequence of symbols from another), so practical translation models must make independence assumptions to reduce the number of parameters needed to be estimated. The first model of Brown et al. (1993), which will be used and expanded in our initial formulation, uses the following approximation</Paragraph>
    <Paragraph position="6"> where M is the number of French words in f, L is the number of English words in e, and a is an alignment that maps each French word to one of the English words, or to the &amp;quot;null&amp;quot; word e0. Pra5 Ma7a15a8 e is constant and Pra5 a j a6 La7a16a8 1a17a18a5 La19 1a7 depends only on the number of English words. The conditional probability of f j depends only on its own alignment to an English word, and not on the translation of other words fi. These assumptions lead to the following formulation, in which ta5 f j a6 ea j a7 defines a translation table from English words to French words:</Paragraph>
    <Paragraph position="8"> To learn such a translation between image objects and text passages, it is necessary to: 1) Define the vocabulary of image objects; 2) Extract this vocabulary from an image; 3) Extract text that describes an image object; 4) Deal with multiple word descriptions of objects; and 5) Deal with compound objects consisting of parts. Duygulu et al. (2002) assume that all words (more specifically, all nouns) are possible names of objects.</Paragraph>
    <Paragraph position="9"> Each segmented region in an image is characterized by a 33-dimensional feature vector. The vocabulary of image objects is defined by a vector quantization of this feature space. In the translation model of Brown et al., Duygulu et al. (2002) substitute the French string f by the sequence w of caption words, and the English string e by the sequence r of regions extracted from the image (which they refer to as blobs, b). They do not consider multiple word sequences describing an image object, nor image objects that consist of multiple regions (oversegmentations or component parts).</Paragraph>
    <Paragraph position="10"> In section 2 we argued that many object categories are better characterized by generic shape descriptions rather than finite sets of appearance-based features. However, in moving to a shape-based representation, we need to deal with image objects consisting of multiple regions (cf. Barnard et al., 2003). We distinguish three different types of multiple region sets:  1. Type A (accidental): Region over-segmentation due to illumination effects or exemplar-specific markings on the object that results in a collection of subregions that is not generic to the object's class.</Paragraph>
    <Paragraph position="11"> 2. Type P (parts): Region over-segmentation common to many exemplars of a given class that results in a collection of subregions that may represent meaningful parts of the object class. In this case, it is assumed that on some occasions, the object is seen as a silhouette, with no over-segmentation into parts.</Paragraph>
    <Paragraph position="12"> 3. Type C (compound): Objects that are always seg null mented into their parts (e.g., due to differently colored or textured parts). This type is similar to Type P, except that these objects never appear as a whole silhouette. (Our mechanism for dealing with these objects will also allow us, in the future, to handle conventional collections of objects, such as a set of chairs with a table.) We can extend the one-to-one translation model in Eqn. (3) above by grouping or merging symbols (in this case, regions) and then treating the group as a new symbol to be aligned. Theoretically, then, multiple regions can be handled in the same translation framework, by adding to the sequence of regions in each image, the regions resulting from all possible merges of image regions:</Paragraph>
    <Paragraph position="14"> where ~L denotes the total number of segmented and merged regions in an image. However, in practice this causes complexity and stability problems; the number of possible merges may be intractable, while the number of semantically meaningful merges is quite small.</Paragraph>
    <Paragraph position="15"> Motivated by the three types of multiple region sets described above, we have instead developed an iterative bootstrapping strategy that filters hypothetically meaningful merges and adds these to the data set. Our method proceeds as follows:  1. As in Dyugulu et al., we calculate a translation  model t0a5 wa6 ra7 between words and regions, using a data set of N image/caption pairs D a8a22a21a23a5 wd</Paragraph>
    <Paragraph position="17"> mented region in image d.</Paragraph>
    <Paragraph position="18">  2. We next account for accidental over-segmentations (Type A above) by adding all merges to the data set that increase the score based on the old translation model:</Paragraph>
    <Paragraph position="20"> That is, we use the current translation model to determine whether to merge any two adjacent regions into a new region. If the quality of the translation is improved by the merge, we add the new region to r.</Paragraph>
    <Paragraph position="21"> If the dataset was extended by any number of new regions, the algorithm starts again with step 1 and recalculates the translation model.</Paragraph>
    <Paragraph position="22"> 3. We then account for regular over-segmentation (Type P above) by extending the number of regions merged for adjacent region sets--i.e., merges are no longer restricted to be pairwise. In this step, though, only sets of regions that frequently appear together in images are candidates for merging. Again, those that increase the score are iteratively added to the data set until the data set is stable.</Paragraph>
    <Paragraph position="23"> 4. For compound objects (Type C above), the score criterion does not apply because the silhouette of the merged structure does not appear in the rest of the data set. Since the current translation model has no information about the whole object, merging the component regions cannot increase the quality of the translation model.</Paragraph>
    <Paragraph position="24"> Instead, we develop a new scoring criterion, based on Melamed (1997). First, the current translation model is used to induce translation links between words and regions, and the mutual information of words and regions is calculated, using the link counts for the joint distribution. Next, the increase in mutual information is estimated for a hypothetical data set Da35 in which the regions of potential compounds are merged. If a compound contributes to an increase in mutual information in Da35 , then the merge is added to our data set.</Paragraph>
    <Paragraph position="25"> 5. The sequence of steps above is repeated until no new regions are added to the data set.</Paragraph>
    <Paragraph position="26"> In our algorithm above, we mapped our three approaches to dealing with region merges to the three types of multiple regions sets identified earlier (Types A, P, C). Indeed, each step in the algorithm is inspired by the corresponding type of region set; however, each step may apply to other types. For example, in a given data set, the legs of a table may only infrequently be segmented into separate regions, so that a merge to form a table may occur in step 2 (Type A) instead of step 3 (Type P). Thus, the actual application of steps 2-4 depends on the precise make-up of regions and their frequencies in the data set. In our demonstration system reported next, step 3 of the algorithm is currently applied without considering how frequent a region pair appears. It iteratively generates three pairwise merges, with the output restricted to those that yield a shape seen before. We expect that considering only frequent shape pairs will stabilize merging effects and reduce computational complexity for more expensive merge operations than on the synthetic dataset. Our implementation of step 4 is in the early stage and currently considers combinations of any two regions, whether adjacent or not. This causes problems for images with more than one object or additional background shapes.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Demonstration
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Scene Generation
</SectionTitle>
      <Paragraph position="0"> As this paper represents work in progress, we have only tested our model on synthetic scenes with captions.</Paragraph>
      <Paragraph position="1"> Scenes contain objects composed of parts drawn from a small vocabulary of eight shapes, numbered 1-8, in Figure 3. (Our shapes are specified in terms of qualitative relationships over lines and curves; precise angle measurement is not important, Dickinson et al., 1992.) To simulate undersegmentation, primitive parts may be grouped into larger regions; for example, an object composed of three parts may appear as a single silhouette, representing the union of the three constituent parts. To simulate oversegmentation, four of the shape primitives (1, 5, 6, 8) can appear according to a finite set of oversegmentation models, as shown in Figure 3. To add ambiguity, over-segmentation may yield a subshape matching one of the shape categories (e.g., primitive shape 5, the trapezoidal shape, can be decomposed into shapes 1 and 4) or, alternatively, matching a subshape arising from a different oversegmentation. For example, the shape in the bottom right of Figure 3 is decomposed into two parts, one of which (25, representing two parallel lines bridged at one end by a concave curve and at the other end by a line) occurs in a different oversegmentation model (in this case, the oversegmentation shown immediately above it).</Paragraph>
      <Paragraph position="2"> Scenes are generated containing one or two objects, drawn from a database of six objects (two chairs, two tables, and two lamps, differing in their primitive decompositions), shown in Figure 4. Given an object model, a decomposition grammar (i.e., a set of rewrite rules) is automatically generated that takes the silhouette of the shape and decomposes it into pieces that are either: 1) unions of the object's primitive parts, representing an undersegmentation of the object; 2) the object's primitive parts; or 3) oversegmentations of the object's primitive parts. In addition, the scene can contain up to four background shapes, drawn from Figure 3. These shapes introduce ambiguity in the mapping from words to objects in the scene, and can participate in merges of regions in our algorithm. Finally, each scene has an associated text caption that contains one word for each database object, which specifies either the name of the whole object (table/stand, chair/stool, lamp/light), or a part of the ob- null struct objects in the scene. Below: The various ways in which four of the shapes (1, 5, 6, 8) can be oversegmented. null ject (base or leg). Just as the scene contains background shapes, the caption may contain up to four &amp;quot;background&amp;quot; words that have nothing to do with the objects (or primitive parts) in the database.</Paragraph>
      <Paragraph position="3"> We have developed a parameterized, synthetic scene generator that uses the derived rules to automatically generate scenes with varying degrees of undersegmentation, oversegmentation, ambiguous background objects, and extraneous caption words. Although no substitute for testing the model on real images, it has the advantage of allowing us to analyze the behavior of the framework as a function of these underlying parameters. Examples of input scenes it produces are shown in Figure 5.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Experimental Results
</SectionTitle>
      <Paragraph position="0"> The first experiment we report here (Exp. 1) tests our ability to learn a translation model in the presence of Type A and Type P segmentation errors. We generated 1000 scenes with the following parameters: 1 or 2 objects per image, forced oversegmentation to a depth of 4, maximum 4 background shapes, one relevant word (part or whole descriptor), and maximum 2 meaningless random words per image. Table 1 shows the translation tables (Pra5 wa6 ra7 ) for this dataset, stopping the algorithm after step 1 (no merging) and after step 3. For all of the objects, the merging step increased the probability of one word, and decreased the probability of the others, creating a stronger word-shape association. For 5 of the objects, the highest probability word is a correct identifier of the object (stand, chair, stool, light, lamp), and for the  other object, a word indicating a part of the object has high probability (leg for the first table object).</Paragraph>
      <Paragraph position="1"> Although increasing the strength of one probability has an advantage, we need to explore ways to allow association of more than one &amp;quot;whole object&amp;quot; word (such as lamp and light) with a single object (cf. Duygulu et al., 2002).</Paragraph>
      <Paragraph position="2"> Since we maintain the component regions of a merged region, having both a part and a whole word, such as leg and table, associated with the same image is not a problem. Incorporating these into a structured word hierarchy should help to focus associations appropriately.</Paragraph>
      <Paragraph position="3"> Another way to view the data is to see which shapes are most consistently associated with the meaningful words in the captions. Here we calculate Pa5 ra6 wa7 by Pra5 wa6 ra7 Pra5 ra7 , with the latter normalized over all shapes. A problem with this formulation is that, due to the Pra5 ra7 component, high frequency shapes can increase the probability of primitive components. However, the merging steps (2 and 3) of our algorithm raise the frequencies of complex (multi-region) shapes. Table 2 shows the five shapes with the highest values for each meaningful word, again before and after the merging steps in Exp. 1. Sev-</Paragraph>
      <Paragraph position="5"/>
      <Paragraph position="7"> for the meaningful words, after step 4. Shape icons (for merged regions) or primitive shapes (indicated by number) have the probability for that word listed below.</Paragraph>
      <Paragraph position="8"> eral complex shapes increase in probability after merging, and a number of new complex shapes appear in the lists.</Paragraph>
      <Paragraph position="9"> We report on one other experiment (Exp. 2) which was designed to test our approach to handling oversegmentations of Type C in step 4 of our algorithm. Our dataset again had 1000 images; here there was only one object per image, but every object was oversegmented into its primitive parts (that is, an object never appeared as a complete silhouette). (We did not allow oversegmentation of the primitives here, nor did we include irrelevant words in the captions.) Because our 6 objects never appear &amp;quot;whole,&amp;quot; steps 2 and 3 of our algorithm cannot apply; before step 4, words are associated with primitive shapes only. After step 4, the highest probability word (Pra5 wa6 ra7 ) for 4 of the objects is a correct identifier of the object (stand, chair, stool, light); for one object, a word indicating a part of the object had high probability (leg for the rectangular table). (One object silhouette--the second lamp--was not fully reconstructed.) Table 3 shows the five shapes with the highest Pra5 ra6wa7 values for each meaningful word, after step 4. For 3 of the whole object words (stand, stool, light), and both part words (leg, base), the best shape is a correct one. For the remaining whole object words (table, chair, lamp), a correct full silhouette is one of the top five. Step 4 clearly has high potential for reconstructing objects that are consistently oversegmented into their parts.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML