XML Viewer - w03-0608

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0608_intro.xml
Size: 8,156 bytes
Last Modified: 2025-10-06 14:01:53
<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0608">
  <Title>Why can't Jos'e read? The problem of learning semantic associations in a robot environment</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> In writing this paper we hope to promote a discussion on the design of an autonomous agent that learns semantic associations in its environment or, more precisely, that learns to associate regions of images with discrete concepts. When an image region is labeled with a concept in an appropriate and consistent fashion, we say that the object has been recognised (Duygulu et al., 2002). We use our laboratory robot, Jos'e (Elinas et al., 2002), as a prototype, but the ideas presented here extend to a wide variety of settings and agents.</Paragraph>
    <Paragraph position="1"> Before we proceed, we must elucidate on the requirements for achieving semantic learning in an autonomous agent context.</Paragraph>
    <Paragraph position="2"> Primarily, we need a model that learns associations between objects given a set of images paired with user input. Formally, the task is to find a function that separates the space of image patch descriptions into nw semantic concepts, where nw is the total number of concepts in the  2002), the mobile robot we used to collect the image data. The images on the right are examples the robot has captured while roaming in the lab, along with labels used for training. We depict image region annotations in later figures, but we emphasize that the robot receives only the labels as input for training. That is, the robot does not know what words correspond to the image regions.</Paragraph>
    <Paragraph position="3"> training set (from now on we use the word &amp;quot;patch&amp;quot; to refer to a contiguous region in an image). These supplied concepts could be in the form of text captions, speech, or anything else that might convey semantic information.</Paragraph>
    <Paragraph position="4"> For the time being, we restrict the set of concepts to English nouns (e.g. &amp;quot;face&amp;quot;, &amp;quot;toothbrush&amp;quot;, &amp;quot;floor&amp;quot;). See Figure 1 for examples of images paired with captions composed of nouns. Despite this restriction, we still leave ourselves open to a great deal of ambiguity and uncertainty, in part because objects can be described at several different levels of specificity, and at the same level using different words (e.g. is it &amp;quot;sea&amp;quot;, &amp;quot;ocean&amp;quot;, &amp;quot;wave&amp;quot; or &amp;quot;water&amp;quot;?). Ideally, one would like to impose a hierarchy of lexical concepts, as in WordNet (Fellbaum, 1998). We have yet to explore WordNet for our proposed framework, though it has been used successfully for image clustering (Barnard et al., 2001; Barnard et al., 2002).</Paragraph>
    <Paragraph position="5"> Image regions, or patches, are described by a set of low-level features such as average and standard deviation of colour, average oriented Gabor filter responses to represent texture, and position in space. The set of patch descriptions forms an nf -dimensional space of real numbers, where nf is the number of features. Even complex low-level features are far from adequate for the task of classifying patches as objects -- at some point we need to move to representations that include high-level information. In this paper we take a small step in that direction since our model learns spatial relations between concepts.</Paragraph>
    <Paragraph position="6"> Given the uncertainty regarding descriptions of objects and their corresponding concepts, we further require that the model be probabilistic. In this paper we use Bayesian techniques to construct our object recognition model.</Paragraph>
    <Paragraph position="7"> Implicitly, we need a thorough method for decomposing an image into conceptually contiguous regions. This is not only non-trivial, but also impossible without considering semantic associations. This motivates the segmentation of images and learning associations between patches and words as tightly coupled processes.</Paragraph>
    <Paragraph position="8"> The subject of segmentation brings up another important consideration. A good segmentation algorithm such as Normalized Cuts (Shi and Malik, 1997) can take on the order of a minute to complete. For many real-time applications this is an unaffordable expense. It is important to abide by real-time constraints in the case of a mobile robot, since it has to simultaneously recognise and negotiate obstacles while navigating in its environment. Our experiments suggest that the costly step of a decoupled segmentation can be avoided without imposing a penalty to object recognition performance.</Paragraph>
    <Paragraph position="9"> Autonomous semantic learning must be considered a supervised process or, as we will see later on, a partiallysupervised process since the associations are made from the perspective of humans. This motivates a second requirement: a system for the collection of data, ideally in an on-line fashion. As mentioned above, user input could come in the form of text or speech. However, the collection of data for supervised classification is problematic and time-consuming for the user overseeing the autonomous agent, since the user is required to tediously feed the agent with self-annotated regions of images. If we relax our requirement on training data acquisition by requesting captions at an image level, not at a patch level, the acquisition of labeled data is suddenly much less challenging. Throughout this paper, we use manual annotations purely for testing only -- we emphasize that the training data includes only the labels paired with images.</Paragraph>
    <Paragraph position="10"> We are no longer exploring object recognition as a strict classification problem, and we do so at a cost since we are no longer blessed with the exact associations between image regions and nouns. As a result, the learning problem is now unsupervised. For a single training image and a particular word token, we must now learn both the probability of generating that word given an object description and the correct association to one of the regions with the image. Fortunately, there is a straightforward parallel between our object recognition formulation and the statistical machine translation problem of building a lexicon from an aligned bitext (Brown et al., 1993; Al-Onaizan et al., 1999). Throughout this paper, we reason about object recognition with this analogy in mind (Duygulu et al., 2002).</Paragraph>
    <Paragraph position="11"> What other requirements should we consider? Since our discussion involves autonomous agents, we should pursue a dynamic data acquisition model. We can consider the problem of learning an object recognition model as an on-line conversation between the robot and the user, and it follows the robot should be able to participate. If the agent ventures into &amp;quot;unexplored territory&amp;quot;, we would like it to make unprompted requests for more assistance.</Paragraph>
    <Paragraph position="12"> One could use active learning to implement a scheme for requesting user input based on what information would be most valuable to classification. This has yet to be explored for object recognition, but it has been applied to the related domain of image retrieval (Tong and Chang, 2001). Additionally, the learning process could be coupled with reinforcement -- in other words, the robot could offer hypotheses for visual input and await feed-back from user.</Paragraph>
    <Paragraph position="13"> In the next section, we outline our proposed contextual translation model. In Section 3, we weigh the merits of several different error measures for the purposes of evaluation. The experimental results on the robot data are given in Section 4. We leave discussion of results and future work to the final section of this paper.</Paragraph>
    <Paragraph position="14">  spondences between label words and image patches. In this example, the correct association is an2 = 4.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML