File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-0101_intro.xml
Size: 3,639 bytes
Last Modified: 2025-10-06 14:00:53
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0101"> <Title>Sentences vs. Phrases: Syntactic Complexity in Multimedia Information Retrieval</Title> <Section position="3" start_page="2" end_page="2" type="intro"> <SectionTitle> 3 Experiments </SectionTitle> <Paragraph position="0"> While the sentence captions are syntactically more complex, by almost any measure, they contain more information than the legacy word list captions. Specifically, the part-of-speech tagger and the noun phrase pattern matcher are essentially useless with the word lists, since they rely on syntactic patterns that are not present. We therefore hypothesized that our retrieval accuracy would be lower with the legacy word list captions than with the sentence captions.</Paragraph> <Paragraph position="1"> We performed two sets of experiments, one with legacy word list captions and the other with sentence captions. Fortunately, the corpus can be easily divided, since it is possible to select image providers with either full sentence or word list captions, and limit the search to those providers. In order to ensure that we did not introduce a bias because of the quality of captioning for a particular provider, we aggregated scores from at least three providers in each test.</Paragraph> <Paragraph position="2"> Because the collection is large and live, and includes ranked results, we selected a modified version of precision at 20 rather than a manual gold standard precision/recall test. We chose this evaluation path for the following reasons: -&quot; 3 We performed experiments initially with manual ranking, and found that it was impossible to get reliable cross-coder judgements for ranked results. That is, we could get humans to assess whether an image should or should not have been included, but the rankings did not yield agreement. Complicating the problem was the fact that we had a large collection (400,000+ images), and creating a test subset meant that most queries would generate almost no relevant results. Finally, we wanted to focus more on precision than on recall, because our work with users had made it clear that precision was far more important in this application.</Paragraph> <Paragraph position="3"> To evaluate precision at 20 for this collection, we used the crossing measure introduced in Flank 1998. The crossing measure (in which any image ranked above another, better-matching image counts as an error) is both finer-grained and better suited to a ranking application in which user evaluations are not binary. We calibrated the crossing measure (on a subset of the queries) as follows: That is, we calculated the precision &quot;for all terms&quot; as a binary measure with respect to a query, and scored an error if any terms in the query were not matched. For the &quot;any term&quot; precision measure, we scored an error only if the image failed to match any term in the query in such a way that a user would consider it a partial match.</Paragraph> <Paragraph position="4"> Thus, for example, for an &quot;all terms&quot; match, tall glass of beer succeeded only when the images showed (and captions mentioned) all three terms tall, glass, and beer, or their synonyms. For an &quot;any-term&quot; match, tall or glass or beer or a direct synonym would need to be present (but not, say, glasses).</Paragraph> <Paragraph position="5"> (For two of the test queries, fewer than 20 images were retrieved, so the measure is, more precisely, R-precision: precision at the number of documents retrieved or at 20 or 5, whichever is less.</Paragraph> </Section> class="xml-element"></Paper>