File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0608_metho.xml

Size: 16,390 bytes

Last Modified: 2025-10-06 14:08:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0608">
  <Title>Why can't Jos'e read? The problem of learning semantic associations in a robot environment</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 A contextual translation model for
</SectionTitle>
    <Paragraph position="0"> object recognition In this paper, we cast object recognition as a machine translation problem, as originally proposed in (Duygulu et al., 2002). Essentially, we translate patches (regions of an image) into words. The model acts as a lexicon, a dictionary that predicts one representation (words) given another representation (patches). First we introduce some notation, and then we build a story for our proposed probabilistic translation model.</Paragraph>
    <Paragraph position="1"> We consider a set of N images paired with their captions. Each training example n is composed of a set of patches fbn1;:::;bnMng and a set of words fwn1;:::;wnLng. Mn is the number of patches in image n and Ln is the number of words in the image's caption. Each bnj Rnf is a vector containing a set of feature values representing colour, texture, position, etc, wherenf is the number of features. For each patch bnj, our objective is to align it to a word from the attached caption. We represent this unknown association by a variable anj, such that ainj = 1 if bnj translates to wni; otherwise, ainj = 0. Therefore, p(ainj) , p(anj = i) is the probability that patch bnj is aligned with word wni in document n. See Figure 2 for an illustration. nw is the total number of word tokens in the training set.</Paragraph>
    <Paragraph position="2"> We construct a joint probability over the translation parameters and latent alignment variables in such a way that maximizing the joint results in what we believe should be the best object recognition model (keeping in mind the limitations placed by our set of features!). Without loss of generality, the joint probability is</Paragraph>
    <Paragraph position="4"> where wn denotes the set of words in the nth caption, an;1:j 1 is the set of latent alignments 1 to j 1 in image n, bn;1:j 1 is the set of patches 1 to j 1, and is the set of model parameters.</Paragraph>
    <Paragraph position="5"> Generally speaking, alignments between words and patches depend on all the other alignments in the image, simply because objects are not independent of each other. These dependencies are represented explicitly in equation 1. However, one usually assumes</Paragraph>
    <Paragraph position="7"> guarantee tractability. In this paper, we relax the independence assumption in order to exploit spatial context in images and words. We allow for interactions between neighbouring image annotations through a pairwise Markov random field (MRF). That is, the probability of a patch being aligned to a particular word depends on the word assignments of adjacent patches in the image. It is reasonable to make the assumption that given the alignment for a particular patch, translation probability is independent from the other patch-word alignments. A simplified version of the graphical model for illustrative purposes is shown in Figure 3.</Paragraph>
    <Paragraph position="8">  document. The shaded circles are the observed nodes (i.e. the data). The white circles are unobserved variables of the model parameters. Lines represent the undirected dependencies between variables. The potential controls the consistency between annotations, while the potentials nj represent the patch-to-word translation probabilities.</Paragraph>
    <Paragraph position="9"> In Figure 3, the potentials nj , p(bnjjw?) are the patch-to-word translation probabilities, where w? denotes a particular word token. We assign a Gaussian distribution to each word token, so p(bnjjw?) = N(bnj; w?; w?). The potential (anj;ank) encodes the compatibility of the two alignments, anj and ank.</Paragraph>
    <Paragraph position="10"> The potentials are the same for each image. That is, we use a single W W matrix , where W is the number of word tokens. The final joint probability is a product of the translation potentials and the inter-alignment potentials:</Paragraph>
    <Paragraph position="12"> where w?(wni) = 1 if the ith word in the nth caption is the word w?; otherwise, it is 0.</Paragraph>
    <Paragraph position="13"> To clarify the unsupervised model described up to this point, it helps to think in terms of counting wordto-patch alignments for updating the model parameters. Loosely speaking, we update the translation parameters w? and w? by counting the number of times particular patches are aligned with word w?. Similarly, we update (w?;w ) by counting the number of times the word tokensw? andw are found in adjacent patch alignments. We normalize the latter count by the overall alignment frequency to prevent counting alignment frequencies twice.</Paragraph>
    <Paragraph position="14"> In addition, we use a hierarchical Bayesian scheme to provide regularised solutions and to carry out automatic feature weighting or selection (Carbonetto et al., 2003). In summary, our learning objective is to find good values for the unknown model parameters , f ; ; ; g, where and are the means and covariances of the Gaussians for each word, is the set of alignment potentials and is the set of shrinkage hyper-parameters for feature weighting. For further details on how to compute the model parameters using approximate EM and loopy belief propagation, we refer the reader to (Carbonetto et al., 2003; Carbonetto and de Freitas, 2003)</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Evaluation metric considerations
</SectionTitle>
    <Paragraph position="0"> Before we discuss what makes a good evaluation metric, it will help if we answer this question: &amp;quot;what makes a good image annotation?&amp;quot; As we will see, there is no  importance of concepts has little or no relation to the area these concepts occupy. On the left, &amp;quot;polar bear&amp;quot; is at least pertinent as &amp;quot;snow&amp;quot; even though it takes up less area in the image. In the photograph on the right, &amp;quot;train&amp;quot; is most likely the focus of attention.</Paragraph>
    <Paragraph position="1"> It is fair to say that certain concepts in an image are more prominent than others. One might take the approach that objects that consume the most space in an image are the most important, and this is roughly the evaluation criterion used in previous papers (Carbonetto et al., 2003; Carbonetto and de Freitas, 2003). Consider the image on the left in Figure 4. We claim that &amp;quot;polar bear&amp;quot; is at least as important as snow. There is an easy way to test this assertion - pretend the image is annotated either entirely as &amp;quot;snow&amp;quot; or entirely as &amp;quot;polar bear&amp;quot;. In our experience, people find the latter annotation as appealing, if not more, than the former. Therefore, one would conclude that it is better to weight all concepts equally, regardless of size, which brings us to the image on the right. If we treat all words equally, having many words in a single label obfuscates the goal of getting the most important concept,&amp;quot;train&amp;quot;, correct.</Paragraph>
    <Paragraph position="2"> Ideally, when collecting user-annotated images for the purpose of evaluation, we should tag each word with a weight to specify its prominence in the scene. In practice, this is problematic because different users focus their attention on different concepts, not to mention the fact that it is an burdensome task.</Paragraph>
    <Paragraph position="3"> For lack of a good metric, we evaluate the proposed translation models using two error measures. Error measure 1 reports an error of 1 if the model annotation with the highest probability results in an incorrect patch annotation. The error is averaged over the number of patches in each image, and then again over the number of images in the data set. Error measure 2 is similar, only we average the error over the patches corresponding to word (according to the manual annotations). The equations are  where Pni is the set of patches in image n that are manually-annotated using word i, banj is the model alignment with the highest probability, ~anj is the provided &amp;quot;true&amp;quot; annotation, and ~anj (banj) is 1 if ~anj = banj. Our intuition is that the metric where we weight all concepts equally, regardless of size, is better overall. As we will see in the next section, our translation models do not perform as well under this error measure. This is due to the fact that the joint probability shown in equation 1 maximises the first error metric, not the second. Since the agent cannot know the true annotations beforehand, it is difficult to construct a model that maximises the second error measure, but we are currently pursuing approximations to this metric.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> We built a data set by having Jos'e the robot roam around the lab taking pictures, and then having laboratory members create captions for the data using a consistent set of words. For evaluation purposes, we manually annotated the images. The robomedia data set is composed of 107 training images and 43 test images 1. The training and test sets contain a combined total of 21 word tokens. The word frequencies in the labels and manual annotations are shown in figure 5.</Paragraph>
    <Paragraph position="1"> In our experiments, we consider two scenarios. In the first, we use Normalized Cuts (Shi and Malik, 1997) to segment the images into distinct patches. In the second scenario, we take on the object recognition task without the aid of a sophisticated segmentation algorithm, and instead construct a uniform grid of patches over the image. Examples of different segmentations are shown along with the anecdotal results in Figure 8. For the crude segmentation, we used patches of height and width approximately 1=6th the size of the image. We found that smaller patches introduced too much noise to the features and resulted in poor test performance, and larger patches contained too many objects at once. In future work, we  finding a particular word in a label and a manually annotated patch, in the robomedia training and test sets. The final two columns show the precision of the translation model tMRF using the grid segmentation for each token, averaged over the 12 trials. Precision is defined as the probability the model's prediction is correct for a particular word and patch. Since precision is 1 minus the error of equation 3, the total precision on both the training and test sets matches the average performance of tMRF-patch on Error measure 2, as shown in in Figure 7. While not presented in the table, the precision on individual words varies significantly from one one trial to the next. Note that some words do not appear in both the training and test sets, hence the n/a.</Paragraph>
    <Paragraph position="2"> yThe model predicts words without access to the test image labels. We provide this information for completeness. zWe can use the manual annotations for evaluation purposes, but we underline the fact that an agent would not have access to the information presented in the &amp;quot;Annotation %&amp;quot; column.</Paragraph>
    <Paragraph position="3"> will investigate a hierarchical patch representation to take into account both short and long range patch interactions, as in (Freeman and Pasztor, 1999).</Paragraph>
    <Paragraph position="4"> We compare two models. The first is the translation model where dependencies between alignments are removed for the sake of tractability, called tInd. The second is the translation model in which we assume dependences  and manual segmentations. When there are multiple annotations in a single patch, any one of them is correct. Even when both are correct, the grid segmentation is usually more precise and, as a result, more closely approximates generic object recognition.</Paragraph>
    <Paragraph position="5"> between adjacent alignments in the image. This model is denoted by tMRF. We represent the sophisticated and crude segmentation scenarios by -seg and -patch, respectively. null One admonition regarding the evaluation procedure: a translation is deemed correct if at least one of the patches corresponds to the model's prediction. In a manner of speaking, when a segment encompasses several concepts, we are giving the model the benefit of the doubt. For example, according to our evaluation the annotations for both the grid and Normalized Cuts segmentations shown in Figure 6 correct. However, from observation the grid segmentation provides a more precise object recognition.</Paragraph>
    <Paragraph position="6"> As a result, evaluation can be unreliable when Normalized Cuts offers poor segmentations. It is also important to remember that the true result images shown in the second column of Figure 8 are idealisations.</Paragraph>
    <Paragraph position="7"> Experimental results on 12 trials are shown in Figure 7, and selected annotations predicted by the tMRF model on the test set are shown in Figure 8. The most significant result is that the contextual translation model performs the best overall, and performs equally well when supplied with either Normalized Cuts or a naive segmentations. We stress that even though the models trained using both the grid and Normalized Cuts segmentations are displayed on the same plots, in Figure 6 we indicate that object recognition using the grid segmentation is generally more precise, given the same evaluation result in Figure 7. Learning contextual dependencies between alignment appears to improve performance, despite the large amount of noise and the increase in the number of model parameters that have to be learned. The contex- null Whisker plot. The middle line of a box represents the median. The central box represents the values from the 25 to 75 percentile, using the upper and lower statistical medians. The horizontal line extends from the minimum to the maximum value, excluding outside and far out values which are displayed as separate points. The dotted line at the top is the random prediction upper bound. Overall, the contextual model tMRF is an improvement over the independent model, tInd. On average, tMRF tends to perform equally well using the sophisticated or naive patch segmentations. tual model also tends to produce more visually appealing annotations since they the translations smoothed over neighbourhoods of patches.</Paragraph>
    <Paragraph position="8"> The performance of the contextual translation model on individual words on the training and test sets is shown in Figure 5, averaged over the trials. Since our approximate EM training a local maximum point estimate for the joint posterior and the initial model parameters are set to random values, we obtain a great deal of variance from one trial to the next, as observed in the Box-and-Whisker plots in Figure 7. While not shown in Figure 5, we have noticed considerable variation in what words are predicted with high precision. For example, the word &amp;quot;ceiling&amp;quot; is predicted with an average success rate of 0:347, although the precision on individual trials ranges from 0 to 0:842.</Paragraph>
    <Paragraph position="9"> Figure 8: Selected annotations on the robomedia test data predicted by the contextual (tMRF) translation model. We show our model's predictions using both sophisticated and crude segmentations. The &amp;quot;true&amp;quot; annotations are shown in the second column. Notice that the annotations using Normalized Cuts tend to be more visually appealing compared to the rectangular grid, but intuition is probably misleading: the error measures in Figure 7 demonstrate that both segmentations produce equally accurate results. It is also important to note that these annotations are probabilistic; for clarity we only display results with the highest probability. From the Bayesian feature weighting priors placed on the word cluster means, we can deduce the relative importance of our feature set. In our experiments, luminance and vertical position in the image are the two most important features.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML