File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/ackno/03/w03-0603_ackno.xml
Size: 1,667 bytes
Last Modified: 2025-10-06 13:50:26
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0603"> <Title>Understanding Complex Visually Referring Utterances</Title> <Section position="9" start_page="0" end_page="0" type="ackno"> <SectionTitle> 8 Summary </SectionTitle> <Paragraph position="0"> We have presented a model of visually-grounded language understanding. At the heart of the model is a set of lexical items, each grounded in terms of visual features and grouping properties when applied to objects in a scene. A robust parsing algorithm finds chunks of syntactically coherent words from an input utterance. To determine the semantics of phrases, the parser activates semantic composers that combine words to determine their joint reference. The robust parser is able to process grammatically ill-formed transcripts of natural spoken utterances. In evaluations, the system selected correct objects in response to utterances for 76.5% of the development set data, and for 58.7% of the test set data. On clean data sets with various speech and processing errors held out, performance was higher yet. We suggested several avenues for improving performance of the system including better methods for spatial grouping, semantically guided backtracking during sentence processing, the use of machine learning to replace hand construction of models, and the use of interactive dialogue to resolve ambiguities.</Paragraph> <Paragraph position="1"> In the near future, we plan to transplant Bishop into an interactive conversational robot (Roy et al., forthcoming 2003), vastly improving the robot's ability to comprehend spatial language in situated spoken dialogue.</Paragraph> </Section> class="xml-element"></Paper>