File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-0805_evalu.xml
Size: 5,739 bytes
Last Modified: 2025-10-06 13:59:50
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0805"> <Title>Exploring Semantic Constraints for Document Retrieval</Title> <Section position="7" start_page="38" end_page="39" type="evalu"> <SectionTitle> 5 Results and Discussion </SectionTitle> <Paragraph position="0"> Our goal is to explore whether using semantic information would improve document retrieval, taking into account the errors introduced by semantic processing. We therefore evaluate two aspects of our system: the accuracy of AV extraction and the precision of document retrieval.</Paragraph> <Section position="1" start_page="38" end_page="38" type="sub_section"> <SectionTitle> 5.1 Evaluate AV Extraction </SectionTitle> <Paragraph position="0"> We tested the AV extraction system on a portion of the annotated documents, which contains 253 AV pairs. Of these pairs, 151 have string values, and the rest have numerical values.</Paragraph> <Paragraph position="1"> The result shows a prediction accuracy of 50.6%, false negatives (missing AV pairs) 35.2%, false positives 11%, and wrong predications 3%. Some attributes such as brand and resolution have higher extraction accuracy than other attributes such as shooting mode and dimension. An analysis of the missing pairs reveals three main sources of error: 1) an incomplete domain model, which misses such camera Condition phrases as &quot;minor surface scratching&quot;; 2) a noisy domain model, due to the automatic nature of its construction; 3) parsing errors caused by free-form human written texts. Considering that the predication accuracy is calculated over 40 attributes and that no human labor is involved in constructing the domain model, we consider our approach a satisfactory first step toward exploring the AV extraction problem.</Paragraph> </Section> <Section position="2" start_page="38" end_page="39" type="sub_section"> <SectionTitle> 5.2 Evaluate AV-based Document Retrieval </SectionTitle> <Paragraph position="0"> The three retrieval systems (S1, S2, and S3) each return top 200 documents for evaluation. Figure 2 summarizes the precision they achieved against both the relaxed and strict judgments, measured by the standard TREC metrics (PN - Precision at N, MAP - Mean Average Precision, RP - R-Precision)1. For both judgments, the combined 1 Precision at N is the precision at N document cutoff point; Average Precision is the average of the precision value obtained after each relevant document is retrieved, and Mean Average Precision is the average of AP over all topics; R-Precision is the precision after R documents have been retrieved, where R is the number of relevant documents for the topic.</Paragraph> <Paragraph position="1"> system S3 achieved higher precision and recall than S1 and S2 by all metrics. In the case of recall, the absolute scores improve at least nine percent. Table 2 shows a pairwise comparison of the systems on three of the most meaningful TREC metrics, using paired T-Test; statistically significant results are highlighted. The table shows that the improvement of S3 over S1 and S2 is significant (or very nearly) by all metrics for the relaxed judgment. However, for the strict judgment, none of the improvements are significant. The reason might be that one third of the topics have no relevant documents in our data set.</Paragraph> <Paragraph position="2"> This reduces the actual number of topics for evaluation. In general, the performance of all three systems for the strict judgment is worse than that for the relaxed, likely due to the lower number of relevant documents for this category (averaged at 18 per topic), which makes it a harder IR task.</Paragraph> <Paragraph position="3"> tion) between systems over all topics The constraint-based system S2 produces higher initial precision than S1 as measured by P10. However, semantic constraints contribute less and less as more documents are retrieved.</Paragraph> <Paragraph position="4"> The performance of S2 is slightly worse than S1 as measured by AP and RP, which is likely due to errors from AV extraction. None of the metrics is statistically significant.</Paragraph> <Paragraph position="5"> Topic-by-topic analysis gives us a more detailed view of the behavior of the three systems. Figure 3 shows the performance of the systems measured by P10, sorted by that of S3. In general, the performance of S1 and S2 deviates significantly for individual topics. However, the combined system, S3, seems to be able to boost the good results from both systems for most topics. We are currently exploring the factors that contribute to the performance boost.</Paragraph> <Paragraph position="6"> A closer look at topics where S3 improves significantly over S1 and S2 at P10 reveals that the combined lists are biased toward the documents returned by S2, probably due to the higher scores assigned to documents by S2 than those by S1. This suggests the need for better score normalization methods that take into account the advantage of each system.</Paragraph> <Paragraph position="7"> In conclusion, our results show that using semantic information can improve IR results for special domains where the information need can be specified as a set of semantic constraints. The constraint-based system itself is not robust enough to be a standalone IR system, and has to be combined with a term-based system to achieve satisfactory results. The IR results from the combined system seem to be able to tolerate significant errors in semantic annotation, considering that the accuracy of AV-extraction is about 50%. It remains to be seen whether similar improvement in retrieval can be achieved in general domains such as news articles.</Paragraph> </Section> </Section> class="xml-element"></Paper>