File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/03/w03-0602_abstr.xml
Size: 884 bytes
Last Modified: 2025-10-06 13:43:07
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0602"> <Title>Words and Pictures in the News</Title> <Section position="1" start_page="0" end_page="0" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> We discuss the properties of a collection of news photos and captions, collected from the Associated Press and Reuters. Captions have a vocabulary dominated by proper names. We have implemented various text clustering algorithms to organize these items by topic, as well as an iconic matcher that identifies articles that share a picture. We have found that the special structure of captions allows us to extract some names of people actually portrayed in the image quite reliably, using a simple syntactic analysis. We have been able to build a directory of face images of individuals from this collection. null</Paragraph> </Section> class="xml-element"></Paper>