File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-3206_evalu.xml
Size: 6,386 bytes
Last Modified: 2025-10-06 13:59:21
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3206"> <Title>Scaling Web-based Acquisition of Entailment Relations</Title> <Section position="10" start_page="0" end_page="0" type="evalu"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"> We evaluated the results of the TE/ASE algorithm on a random lexicon of verbal forms and then assessed its performance on the extracted data through human-based judgments.</Paragraph> <Paragraph position="2"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Experimental Setting </SectionTitle> <Paragraph position="0"> The test set for human evaluation was generated by picking out 53 random verbs from the 1000 most frequent ones found in a subset of the Reuters corpus2. For each verb entry in the lexicon, we provided the judges with the corresponding pivot template and the list of related candidate entailment templates found by the system. The judges were asked to evaluate entailment for a total of 752 templates, extracted for 53 pivot lexicon entries; Table 1 shows a sample of the evaluated templates; all of them are clearly good and were judged as correct ones.</Paragraph> <Paragraph position="1"> included in the evaluation test set.</Paragraph> <Paragraph position="2"> Concerning the ASE algorithm, threshold parameters3 were set as PHRASEMAXF=107, SET-MINF=102, SETMAXF=105, SETMINP=0.066, and SETMAXP=0.666. An upper limit of 30 was imposed on the number of possible anchor sets used for each pivot. Since this last value turned out to be very conservative with respect to system cover- null con before the actual experiment.</Paragraph> <Paragraph position="3"> age, we subsequently attempted to relax it to 50 (see Discussion in Section 4.3).</Paragraph> <Paragraph position="4"> Further post-processing was necessary over extracted data in order to remove syntactic variations referring to the same candidate template (typically passive/active variations).</Paragraph> <Paragraph position="5"> Three possible judgment categories have been considered: Correct if an entailment relationship in at least one direction holds between the judged template and the pivot template in some non-bizarre context; Incorrect if there is no reasonable context and variable instantiation in which entailment holds; No Evaluation if the judge cannot come to a definite conclusion.</Paragraph> </Section> </Section> <Section position="11" start_page="0" end_page="0" type="evalu"> <SectionTitle> 4.2 Results </SectionTitle> <Paragraph position="0"> Each of the three assessors (referred to as J#1, J#2, and J#3) issued judgments for the 752 different templates. Correct templates resulted to be 283, 313, and 295 with respect to the three judges. No evaluation's were 2, 0, and 16, while the remaining templates were judged Incorrect.</Paragraph> <Paragraph position="1"> For each verb, we calculate Yield as the absolute number of Correct templates found and Precision as the percentage of good templates out of all extracted templates. Obtained Precision is 44.15%, averaged over the 53 verbs and the 3 judges. Considering Low Majority on judges, the precision value is 42.39%.</Paragraph> <Paragraph position="2"> Average Yield was 5.5 templates per verb.</Paragraph> <Paragraph position="3"> These figures may be compared (informally, as data is incomparable) with average yield of 10.1 and average precision of 50.3% for the 9 &quot;pivot&quot; templates of (Lin and Pantel, 2001). The comparison suggests that it is possible to obtain from the (very noisy) web a similar range of precision as was obtained from a clean news corpus. It also indicates that there is potential for acquiring additional templates per pivot, which would require further research on broadening efficiently the search for additional web data per pivot.</Paragraph> <Paragraph position="4"> Agreement among judges is measured by the Kappa value, which is 0.55 between J#1 and J#2, 0.57 between J#2 and J#3, and 0.63 between J#1 and J#3. Such Kappa values correspond to moderate agreement for the first two pairs and substantial agreement for the third one. In general, unanimous agreement among all of the three judges has been reported on 519 out of 752 templates, which corresponds to 69%.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Discussion </SectionTitle> <Paragraph position="0"> Our algorithm obtained encouraging results, extracting a considerable amount of interesting templates and showing inherent capability of discovering complex semantic relations.</Paragraph> <Paragraph position="1"> Concerning overall coverage, we managed to find correct templates for 86% of the verbs (46 out of 53). Nonetheless, presented results show a substantial margin of possible improvement. In fact yield values (5.5 Low Majority, up to 24 in best cases), which are our first concern, are inherently dependent on the breadth of Web search performed by the ASE algorithm. Due to computational time, the maximal number of anchor sets processed for each verb was held back to 30, significantly reducing the amount of retrieved data.</Paragraph> <Paragraph position="2"> In order to further investigate ASE potential, we subsequently performed some extended experiment trials raising the number of anchor sets per pivot to 50. This time we randomly chose a subset of 10 verbs out of the less frequent ones in the original main experiment. Results for these verbs in the main experiment were an average Yield of 3 and an average Precision of 45.19%. In contrast, the extended experiments on these verbs achieved a 6.5 Yield and 59.95% Precision (average values). These results are indeed promising, and the substantial growth in Yield clearly indicates that the TE/ASE algorithms can be further improved. We thus suggest that the feasibility of our approach displays the inherent scalability of the TE/ASE process, and its potential to acquire a large entailment relation KB using a full scale lexicon.</Paragraph> <Paragraph position="3"> A further improvement direction relates to template ranking and filtering. While in this paper we considered anchor sets to have equal weights, we are also carrying out experiments with weights based on cross-correlation between anchor sets.</Paragraph> </Section> </Section> class="xml-element"></Paper>