File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0606_metho.xml

Size: 19,303 bytes

Last Modified: 2025-10-06 14:08:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0606">
  <Title>Learning Word Meaning and Grammatical Constructions from Narrated Video Events</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. Visual Scenes and analysis
</SectionTitle>
    <Paragraph position="0"> For a given video sequence the visual scene analysis generates the corresponding event description in the format event(agent, object, recipient) .</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Single Event Labeling
</SectionTitle>
      <Paragraph position="0"> Events are defined in terms of contacts between elements. A contact is defined in terms of the time at which it occurred, the agent, object, and duration of the contact. The agent is determined as the element that had a larger relative velocity towards the other element involved in the contact. Based on these parameters of contact, scene events are recognized as follows: Touch(agent, object): A single contact, in which (a) the duration of the contact is inferior to touch_duration (1.5 seconds), and (b) the object is not displaced during the duration of the contact.</Paragraph>
      <Paragraph position="1"> Push(agent, object): A single contact in which (a) the duration of the contact is superior or equal to touch_duration and inferior to take_duration (5 sec), (b) the object is displaced during the duration of the contact, and (c) the agent and object are not in contact at the end of the event.</Paragraph>
      <Paragraph position="2"> Take(agent, object) : A single contact in which (a) the duration of contact is superior or equal to take_duration , (b) the object is displaced during the contact, and (c) the agent and object remain in contact.</Paragraph>
      <Paragraph position="3"> Take(agent, object, source): Multiple contacts, as the agent takes the object from the source. For the first contact between the agent and the object (a) the duration of contact is superior or equal to take_duration , (b) the object is displaced during the contact, and (c) the agent and object remain in contact. For the optional second contact between the agent and the source (a) the duration of the contact is inferior to take_duration , and (b) the agent and source do not remain in contact. Finally, contact between the object and source is broken during the event.</Paragraph>
      <Paragraph position="4"> Give(agent, object, recipient): In this multiple contact event, the agent first takes the object, and then gives the object to the recipient. For the first contact between the agent and the object (a) the duration of contact is inferior to take_duration , (b) the object is displaced during the contact, and (c) the agent and object do not remain in contact. For the second contact between the object and the recipient (a) the duration of the contact is superior to take_duration , and (b) the object and recipient remain in contact. For the third (optional) contact between the agent and the recipient (a) the duration of the contact is inferior to take_duration and thus the elements do not remain in contact.</Paragraph>
      <Paragraph position="5"> These event labeling templates form the basis for a template matching algorithm that labels events based on the contact list, similar to the spanning interval and event logic of Siskind (2001).</Paragraph>
      <Paragraph position="6">  The events described above are simple in the sense that there have no hierarchical structure. This imposes serious limitations on the syntactic complexity of the corresponding sentences (Feldman et al. 1996, Miikkulainen 1996). The sentence &amp;quot;The block that pushed the moon was touched by the triangle&amp;quot; illustrates a complex event that exemplifies this issue. The corresponding compound event will be recognized and represented as a pair of temporally successive simple event descriptions, in this case: push(block, moon) , and touch(triangle, block) . The &amp;quot;block&amp;quot; serves as the link that connects these two simple events in order to form a complex hierarchical event.</Paragraph>
      <Paragraph position="7"> 3. Structure mapping for language learning The mapping of sentence form onto meaning (Goldberg 1995) takes place at two distinct levels: Words are associated with individual components of event descriptions, and grammatical structure is associated with functional roles within scene events. The first level has been addressed by Siskind (1996), Roy &amp; Pentland (2000) and Steels (2001) and we treat it here in a relatively simple but effective manner. Our principle interest lies more in the second level of mapping between scene and sentence structure.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Word Meaning
</SectionTitle>
      <Paragraph position="0"> In the initial learning phases there is no influence of syntactic knowledge and the word-referent associations are stored in the WordToReferent matrix (Eqn 1) by associating every word with every referent in the current scene ( a = 0), exploiting the cross-situational regularity (Siskind 1996) that a given word will have a higher coincidence with referent to which it refers than with other referents. This initial word learning contributes to learning the mapping between sentence and scene structure (Eqn. 4, 5 &amp; 6 below). Then, knowledge of the syntactic structure, encoded in FormToMeaning can be used to identify the appropriate referent (in the SEA) for a given word (in the OCA), corresponding to a non-zero value of a in Eqn. 1. In this &amp;quot;syntactic bootstrapping&amp;quot; for the new word &amp;quot;gugle,&amp;quot; for example, syntactic knowledge of Agent-Event-Object structure of the sentence &amp;quot;John pushed the gugle&amp;quot; can be used to assign &amp;quot;gugle&amp;quot; to the object of push.</Paragraph>
      <Paragraph position="2"> words in OCA are translated to Predicted Referents in the PRA via the WorldToReferent mapping. PRA elements are mapped onto their roles in the SEA by the FormToMeaning mapping, specific to each sentence type.</Paragraph>
      <Paragraph position="3"> This mapping is retrieved from Construction Inventory, via the ConstructionIndex that encodes the closed class words that characterize each sentence type.</Paragraph>
      <Paragraph position="4"> Open vs Closed Class Word Categories Our approach is based on the cross-linguistic observation that open class words (e.g. nouns, verbs, adjectives and adverbs) are assigned to their thematic roles based on word order and/or grammatical function words or morphemes (Bates et al. 1982). Newborn infants are sensitive to the perceptual properties that distinguish these two categories (Shi et al. 1999), and in adults, these categories are processed by dissociable neurophysiological systems (Brown et al. 1999).</Paragraph>
      <Paragraph position="5">  Similarly, artificial neural networks can also learn to make this function/content distinction (Morgan et al. 1996). Thus, for the speech input that is provided to the learning model open and closed class words are directed to separate processing streams that preserve their order and identity, as indicated in Figure 1.</Paragraph>
      <Paragraph position="6"> Note that by making this dissociation between open and closed class elements, the grammar learning problem is substantially simplified. Again, it is thus of interest that newborn infants can perform this lexical categorization (Shi et al. 1999), and we have recently demonstrated that a recurrent network of leaky integrator neurons can categorize open and closed class words based on the structure of the F0 component of the speech signal in French and English ( Blanc, Dodane &amp; Dominey 2003).</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Mapping Sentence to Meaning
</SectionTitle>
      <Paragraph position="0"> In terms of the architecture in Figure 2, this mapping can be characterized in the following successive steps.</Paragraph>
      <Paragraph position="1"> First, words in the Open Class Array are decoded into their corresponding scene referents (via the WordToReferent mapping) to yield the Predicted Referents Array that contains the translated words while preserving their original order from the OCA (Eqn 2).</Paragraph>
      <Paragraph position="3"> Next, each sentence type will correspond to a specific form to meaning mapping between the PRA and the SEA.</Paragraph>
      <Paragraph position="4"> encoded in the FormToMeaning array. The problem will be to retrieve for each sentence type, the appropriate corresponding FormToMeaning mapping. To solve this problem, we recall that each sentence type will have a unique constellation of closed class words and/or bound morphemes (Bates et al. 1982) that can be coded in a ConstructionIndex (Eqn.3) that forms a unique identifier for each sentence type. Thus, the appropriate FormToMeaning mapping for each sentence type can be indexed in ConstructionInventory by its corresponding</Paragraph>
      <Paragraph position="6"> The link between the ConstructionIndex and the corresponding FormToMeaning mapping is established as follows. As each new sentence is processed, we first reconstruct the specific FormToMeaning mapping for that sentence (Eqn 4), by mapping words to referents (in PRA) and referents to scene elements (in SEA). The resulting, FormToMeaningCurrent encodes the correspondence between word order (that is preserved in the PRA Eqn 2) and thematic roles in the SEA. Note that the quality of FormToMeaningCurrent will depend on the quality of acquired word meanings in WordToReferent. Thus, syntactic learning requires a minimum baseline of semantic knowledge.</Paragraph>
      <Paragraph position="8"> Given the FormToMeaningCurrent mapping for the current sentence, we can now associate it in the ConstructionInventory with the corresponding function word configuration or ConstructionIndex for that sentence, expressed in (Eqn 5).</Paragraph>
      <Paragraph position="10"> Finally, once this learning has occurred, for new sentences we can now extract the FormToMeaning mapping from the learned ConstructionInventory by using the ConstructionIndex as an index into this associative memory, illustrated in Eqn. 6.</Paragraph>
      <Paragraph position="12"> To accommodate the dual scenes for complex events Eqns. 4-7 are instantiated twice each, to represent the two components of the dual scene. In the case of simple scenes, the second component of the dual scene representation is null.</Paragraph>
      <Paragraph position="13"> We evaluate performance by using the WordToReferent and FormToMeaning knowledge to construct for a given input sentence the &amp;quot;predicted scene&amp;quot;. That is, the model will construct an internal representation of the scene that should correspond to the input sentence. This is achieved by first converting the Open-Class-Array into its corresponding scene items in the Predicted-Referents-Array as specified in Eqn. 2. The referents are then re-ordered into the proper scene representation via application of the FormToMeaning transformation as described in Eqn. 7.</Paragraph>
      <Paragraph position="15"> When learning has proceeded correctly, the predicted scene array (PSA) contents should match those of the scene event array (SEA) that is directly derived from input to the model. We then quantify performance error in terms of the number of mismatches between PSA and SEA.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4. Experimental results
</SectionTitle>
    <Paragraph position="0"> Hirsh-Pasek &amp; Golinkof (1996) indicate that children can use knowledge of word meaning to acquire a fixed SVO template around 18 months, and then expand this to non-canonical sentence forms around 24+ months.</Paragraph>
    <Paragraph position="1"> Tomasello (1999) similarly indicates that fixed grammatical constructions will be used initially, and that these will then provide the basis for the development of more generalized constructions (Goldberg 1995). The following experiments attempt to follow this type of developmental progression.</Paragraph>
    <Paragraph position="2">  A. Learning of Active Forms for Simple Events 1. Active: The block pushed the triangle.</Paragraph>
    <Paragraph position="3"> 2. Dative: The block gave the triangle to the moon.  For this experiment, 17 scene/sentence pairs were generated that employed the 5 different events, and narrations in the active voice, corresponding to the grammatical forms 1 and 2. The model was trained for 32 passes through the 17 scene/sentence pairs for a total of 544 scene/sentence pairs. During the first 200 scene/sentence pair trials, a in Eqn. 1 was 0 (i.e. no syntactic bootstrapping before syntax is adquired), and thereafter it was 1. This was necessary in order to avoid the random effect of syntactic knowledge on semantic learning in the initial learning stages. The trained system displayed error free performance for all 17 sentences, and generalization to new sentences that had not previously been tested.</Paragraph>
    <Paragraph position="4"> B. Passive forms This experiment examined learning active and passive grammatical forms, employing grammatical forms 1-4. Word meanings were used from Experiment A, so only the structural FormToMeaning mappings were learned.</Paragraph>
    <Paragraph position="5">  3. Passive: The triangle was pushed by the block. 4. Dative Passive: The moon was given to the triangle  by the block.</Paragraph>
    <Paragraph position="6"> Seventeen new scene/sentence pairs were generated with active and passive grammatical forms for the narration. Within 3 training passes through the 17 sentences (51 scene/sentence pairs), error free performance was achieved, with confirmation of error free generalization to new untrained sentences of these types. The rapid learning indicates the importance of lexicon in establishing the form to meaning mapping for the grammatical constructions.</Paragraph>
    <Paragraph position="7"> C. Relative forms for Complex Events Here we consider complex scenes narrated by sentences with relative clauses. Eleven complex scene/sentence pairs were generated with narration corresponding to the grammatical forms indicated in 5 - 10:  moon.</Paragraph>
    <Paragraph position="8"> After presentation of 88 scene/sentence pairs, the model performed without error for these 6 grammatical forms, and displayed error-free generalization to new sentences that had not been used during the training for all six grammatical forms.</Paragraph>
    <Paragraph position="9"> D. Combined Test with and Without Lexicon A total of 27 scene/sentence pairs, used in Experiments B and C, were employed that exercised the ensemble of grammatical forms 1 - 10 using the learned WordToReferent mappings. After exposure to 162 scene/sentence pairs the model performed and generalized without error. When this combined test was performed without the pre-learned lexical mappings in WordToReferent, the system failed to converge, illustrating the advantage of following the developmental progression from lexicon to simple to complex grammatical structure. This also illustrates the importance of interaction between syntactic and semantic knowledge that is treated in more detail in Dominey (2000).</Paragraph>
    <Paragraph position="10"> E. Some Scaling Issues A small lexicon and construction inventory are used to illustrate the system behavior. Based on the independant representation formats, the architecture should scale well. The has now been tested with a larger lexicon, and has learned over 35 grammatical constructions. The system should extend to all languages in which sentence to meaning mapping is encoded by word order and/or grammatical marking (Bates et al. 1982). In the current study, deliberate human event production yielded essentially perfect recognition, though the learning model is relatively robust (Dominey 2000) to elevated scene error rates.</Paragraph>
    <Paragraph position="11"> F. Representing Hierarchical Structure The knowledge of the system is expressed in the WorldToReferent and FormToMeaning matrices. In order to deal with complex sentences with embedded clauses, it is necessary to use this same knowledge at different levels of the hierarchy. For this, a &amp;quot;branching mechanism&amp;quot; is necessary, that ordinates the input and output vectors corresponding to meaning and word events. An effective solution to that problem is to learn the branching for each construction as we have done. However, a real account of the human faculty of recursion should be both general (i.e. it should apply to any reasonably complex structure) and plausible (i.e. the branching mechanism should be connectionist). In order to provide this level of generality, neural models need to include a logical &amp;quot;stack&amp;quot; ( cf Miikkulainen 1996), in order to process the context of embedded sentences. Complex structures themselves may be represented in a connectionist way, using the Recursive Auto-Associative Memory (Pollack, 1990). In (Voegtlin and Dominey 2003), we proposed a representation system for complex events, that is both generative (it can handle any structure) and systematic (it can generalize, and it does so in a compositional way). This system could be used here, as its representation readily provides a case-role system. The advantages are twofold. First, the branching mechanism is implemented in a neurally realistic way. Second, the recursion capability of the system will allow it to apply its knowledge to any sentence form, whether known or new. Future research will address this issue.</Paragraph>
    <Paragraph position="12"> Conclusion The current study demonstrates (1) that the perceptual primitive of contact (available to infants at 5 months), can be used to perform event description in a manner that is similar to but significantly simpler than Siskind (2001), (2) that a novel implementation of principles from construction grammar can be used to map sentence form to these meanings together in an integrated system, (3) that relative clauses can be processed in a manner that is similar to, but requires less specific machinery (e.g. no stack) than that in Miikkalanian (1996), and finally (4) that the resulting system displays robust acquisition behavior that reproduces certain observations from developmental studies with very modest &amp;quot;innate&amp;quot; language specificity.</Paragraph>
    <Paragraph position="13"> Note that one could have taken the same approach by integrating Siskind's (2001) full event system, and Miikkulainen's (1996) embedded case-role system. Each of these however required significant architectural complexity to accomplish the full job. The current goal was to identify minimal event recognition and form-tomeaning mapping capabilities that could be integrated into a coherent system that performs at the level of a human infant in the first years of development when the construction inventory is being built up. This forms the basis for the infant's subsequent ability to de- and recompose these constructions in a truly compositional manner, a topic of future research.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML