File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0606_intro.xml
Size: 7,064 bytes
Last Modified: 2025-10-06 14:01:53
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0606"> <Title>Learning Word Meaning and Grammatical Constructions from Narrated Video Events</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> Feldman et al. (1990) posed the problem of &quot;miniature&quot; language acquisition based on <sentence, image> pairs as a &quot;touchstone&quot; for cognitive science. In this task, an artificial system is confronted with a reduced version of the problem of language acquisition faced by the child, that involves both the extraction of meaning from the image, and the mapping of the paired sentence onto this meaning.</Paragraph> <Paragraph position="1"> In this developmental context, Mandler (1999) suggested that the infant begins to construct meaning from the scene based on the extraction of perceptual primitives. From simple representations such as contact, support, attachment (Talmy 1988) the infant could construct progressively more elaborate representations of visuospatial meaning. Thus, the physical event &quot;collision&quot; is a form of the perceptual primitive &quot;contact&quot;. Kotovsky & Baillargeon (1998) observed that at 6 months, infants demonstrate sensitivity to the parameters of objects involved in a collision, and the resulting effect on the collision, suggesting indeed that infants can represent contact as an event predicate with agent and patient arguments.</Paragraph> <Paragraph position="2"> Siskind (2001) has demonstrated that force dynamic primitives of contact, support, attachment can be extracted from video event sequences and used to recognize events including pick-up, put-down, and stack based on their characterization in an event logic. The use of these intermediate representations renders the system robust to variability in motion and view parameters. Most importantly, Siskind demonstrated that the lexical semantics for a number of verbs could be established by automatic image processing.</Paragraph> <Paragraph position="3"> Sentence to meaning mapping: Once meaning is extracted from the scene, the significant problem of mapping sentences to meanings remains. The nativist perspective on this problem holds that the <sentence, meaning> data to which the child is exposed is highly indeterminate, and underspecifies the mapping to be learned. This &quot;poverty of the stimulus&quot; is a central argument for the existence of a genetically specified universal grammar, such that language acquisition consists of configuring the UG for the appropriate target language ( Chomsky 1995 ). In this framework, once a given parameter is set, its use should apply to new constructions in a generalized, generative manner.</Paragraph> <Paragraph position="4"> An alternative functionalist perspective holds that learning plays a much more central role in language acquisition. The infant develops an inventory of grammatical constructions as mappings from form to meaning (Goldberg 1995). These constructions are initially rather fixed and specific, and later become generalized into a more abstract compositional form employed by the adult (Tomasello 1999). In this context, construction of the relation between perceptual and cognitive representations and grammatical form plays a central role in learning language (e.g. Feldman et al. 1990, 1996; Langacker 1991; Mandler 1999; Talmy 1998).</Paragraph> <Paragraph position="5"> These issues of learnability and innateness have provided a rich motivation for simulation studies that have taken a number of different forms. Elman (1990) demonstrated that recurrent networks are sensitive to predictable structure in grammatical sequences.</Paragraph> <Paragraph position="6"> Subsequent studies of grammar induction demonstrate how syntactic structure can be recovered from sentences (e.g. Stolcke & Omohundro 1994). From the &quot;grounding of language in meaning&quot; perspective (e.g. Feldman et al. 1990, 1996; Langacker 1991; Goldberg 1995) Chang & Maia (2001) exploited the relations between action representation and simple verb frames in a construction grammar approach. In effort to consider more complex grammatical forms, Miikkulainen (1996) demonstrated a system that learned the mapping between relative phrase constructions and multiple event representations, based on the use of a stack for maintaining state information during the processing of the next embedded clause in a recursive manner.</Paragraph> <Paragraph position="7"> In a more generalized approach, Dominey (2000) exploited the regularity that sentence to meaning mapping is encoded in all languages by word order and grammatical marking (bound or free) (Bates et al. 1982). That model was based on the functional neurophysiology of cognitive sequence and language processing and an associated neural network model that has been demonstrated to simulate interesting aspects of infant (Dominey & Ramus 2000) and adult language processing (Dominey et al. 2003).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Objectives </SectionTitle> <Paragraph position="0"> The goals of the current study are fourfold: First to test the hypothesis that meaning can be extracted from visual scenes based on the detection of contact and its parameters in an approach similar to but significantly simplified from Siskind (2001); Second to determine whether the model of Dominey (2000) can be extended to handle embedded relative clauses; Third to demonstrate that these two systems can be combined to perform miniature language acquisition; and finally to demonstrate that the combined system can provide insight into the developmental progression in human language acquisition without the necessity of a pre-wired parameterized grammar system (Chomsky 1995).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> The Training Data </SectionTitle> <Paragraph position="0"> The human experimenter enacts and simultaneously narrates visual scenes made up of events that occur between a red cylinder, a green block and a blue semicircle or &quot;moon&quot; on a black matte table surface. A video camera above the surface provides a video image that is processed by a color-based recognition and tracking system (Smart - Panlab, Barcelona Spain) that generates a time ordered sequence of the contacts that occur between objects that is subsequently processed for event analysis (below). The simultaneous narration of the ongoing events is processed by a commercial speech-to-text system (IBM ViaVoice TM ). Speech and vision data were acquired and then processed off-line yielding a data set of matched sentence - scene pairs that were provided as input to the structure mapping model. A total of ~300 <sentence, scene> pairs were tested in the following experiments.</Paragraph> </Section> </Section> class="xml-element"></Paper>