File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1621_metho.xml
Size: 20,539 bytes
Last Modified: 2025-10-06 14:10:46
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1621"> <Title>Lexical Reference: a Semantic Matching Subtask</Title> <Section position="5" start_page="172" end_page="174" type="metho"> <SectionTitle> 3 The Lexical Reference Dataset </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="172" end_page="173" type="sub_section"> <SectionTitle> 3.1 Motivation and Definition </SectionTitle> <Paragraph position="0"> One of the major observations of the 1st Recognizing Textual Entailment (RTE-1) challenge referred to the rich structure of entailment modeling systems and the need to evaluate and optimize individual components within them. When building such a compound system it is valuable to test each component directly during its development, rather than indirectly evaluating the component's performance via the behavior of the entire system. If given tools to evaluate each component independently researchers can target and perfect the performance of the subcomponents without the need of building and evaluating the entire end-to-end system.</Paragraph> <Paragraph position="1"> A common subtask, addressed by practically all participating systems in RTE-1, was to recognize whether each lexical meaning in the hypothesis is referenced by some meaning in the corresponding text. We suggest that this common goal can be captured through the following definition: Definition 1 A word w is lexically referenced by a text t if there is an explicit or implied reference from a set of words in t to a possible meaning of w.</Paragraph> <Paragraph position="2"> Lexical reference may be viewed as a natural extension of textual entailment for sub-sentential hypotheses such as words. In this work we focus on words meanings, however this work can be directly generalized to word compounds and phrases. A concrete version of detailed annotation guidelines for lexical reference is presented in the next section.1 Lexical Reference is, in some sense, a more general notion than paraphrases. If the text includes a paraphrase for w then naturally it does refer to w's meaning. However, a text need not include a paraphrase for the concrete meaning of the referenced word w, but only an implied reference. Accordingly, the referring part might be a large segment of the text, which captures information different than w's meaning, but still implies a reference to w as part of the text's meaning.</Paragraph> <Paragraph position="3"> It is typically a necessary, but not sufficient, condition for textual entailment that the lexical concepts in a hypothesis h are referred in a given text t. For example, in order to infer from a text the hypothesis &quot;a dog bit a man,&quot; it is a necessary that the concepts of dog, bite and man must be referenced by the text, either directly or in an implied manner. However, for proper entailment it is further needed that the right relations would hold between these concepts2. Therefore lexical entailment should typically be a component within a more complex entailment modeling (or semantic matching) system.</Paragraph> </Section> <Section position="2" start_page="173" end_page="174" type="sub_section"> <SectionTitle> 3.2 Dataset Creation and Annotation Process </SectionTitle> <Paragraph position="0"> We created a lexical reference dataset derived from the RTE-1 development set by randomly choosing 400 out of the 567 text-hypothesis examples. We then created sentence-word examples for all content words in the hypotheses which do not appear in the corresponding sentence and are not a morphological derivation of a word in it (since a simple morphologic module could easily identify these cases). This resulted in a total of 708 lexical reference examples. Two annotators annotated these examples as described in the next section.</Paragraph> <Paragraph position="1"> 1These terms should not be confused with the use of lexical entailment in WordNet, which is used to describe an entailment relationship between verb lexical types, nor with the related notion of reference in classical linguistics, generally describing the relation between nouns or pronouns and objects that are named by them (Frege, 1892) 2or quoting the known journalism saying - &quot;Dog bites man&quot; isn't news, but &quot;Man bites dog&quot; is.</Paragraph> <Paragraph position="2"> Taking the same approach as of the RTE-1 dataset creation (Dagan et al., 2006), we limited our experiments to the resulting 580 examples that the two annotators agreed upon3.</Paragraph> <Paragraph position="3"> We asked two annotators to annotate the sentence-word examples according to the following guidelines. Given a sentence and a target word the annotators were asked to decide whether the target word is referred by the sentence (true) or not (false). Annotators were guided to mark the pair as true in the following cases: Word: if there is a word in the sentence which, in the context of the sentence, implies a meaning of the target word (e.g. a synonym or hyponym), or which implies a reference to the target word's meaning (e.g. blind-see, sight). See examples 1-2 in Table 1 where the word that implies the reference is emphasized in the text. Note that in example 2 murder is not a synonym of died nor does it sharethesamemeaningofdied; howeveritisclear from its presence in the sentence that it refers to a death. Also note that in example 8 although home is a possible synonym for house, in the context of the text it does not appear in that meaning and the example should be annotated as false.</Paragraph> <Paragraph position="4"> Phrase: if there is a multi-word independent expressioninthesentencethatimpliesthetarget(im- null plication in the same sense that a Word does). See examples 3-4 in Table 1.</Paragraph> <Paragraph position="5"> Context: if there is a clear reference to the meaning of the target word by the overall meaning of some part(s) of the sentence (possibly all the sentence), though it is not referenced by any single word or phrase. The reference is derived from the completecontextoftherelevantsentencepart. See examples 5-7 in Table 1.</Paragraph> <Paragraph position="6"> If there is no reference from the sentence to the target word the annotators were instructed to choose false. In example 9 in Table 1 the target word &quot;HIV-positive&quot; should be considered as one word that cannot be broken down from its unit and although both the general term &quot;HIV status&quot; and the more specific term &quot;HIV negative&quot; are referred to, thetargetwordcannotbeunderstoodorderived from the text. In example 10 although the year 1945 may refer to a specific war, there is no &quot;war&quot; either specifically or generally understood by the text.</Paragraph> </Section> </Section> <Section position="6" start_page="174" end_page="174" type="metho"> <SectionTitle> ID TEXT TARGET VALUE </SectionTitle> <Paragraph position="0"> 6 Recreational marijuana smokers are no more likely to develop oral cancer than nonusers. risk context 7 A bus ticket cost nowadays 5.2 NIS whereas last year it cost 4.9. increase context 8 Pakistani officials announced that two South African men in their custody had confessed to planning attacks at popular tourist spots in their home country.</Paragraph> <Paragraph position="1"> house false 9 For women who are HIV negative or who do not know their HIV status, breastfeeding should be promoted for six months.</Paragraph> <Paragraph position="2"> HIV-positive false 10 On Feb. 1, 1945, the Polish government made Warsaw its capital, and an office for urban reconstruction was set up.</Paragraph> <Paragraph position="3"> war false Wemeasuredtheagreementonthelexicalreference binary task (in which Word, Phrase and Context are conflated to true). The resulting kappa statistic of 0.63 is regarded as substantial agreement (Landis and Koch, 1997). The resulting dataset is not balanced in terms of true and false examples and a straw-baseline for accuracy is 0.61, representing a system which predicts all examples as true.</Paragraph> <Section position="1" start_page="174" end_page="174" type="sub_section"> <SectionTitle> 3.3 Dataset Analysis Inasimilarmannerto(Bar-Haimetal.,2005; Van- </SectionTitle> <Paragraph position="0"> derwende et al., 2005) we investigated the relationship between lexical reference and textual entailment. We checked the performance of a textual entailment system which relies solely on an ideal lexical reference component which makes no mistakes and asserts that a hypothesis is entailed from atextifandonlyifallcontentwordsinthehypothesis are referred in the text. Based on the lexical reference dataset annotations, such an &quot;ideal&quot; system would obtain an accuracy of 74% on the corresponding subset of the textual entailment task.</Paragraph> <Paragraph position="1"> The corresponding precision is 68% and a recall of 82%. This is significantly higher than the results of the best performing systems that participated in the challenge on the RTE-1 test set. This suggests that lexical reference is a valuable sub-task for entailment. Interestingly, a similar entailment system based on a lexical reference component which doesn't account for the contextual lexical reference (i.e. all Context annotations are regarded as false) would achieve an accuracy of only 63% with 41% precision and a recall of 63%. This suggests that lexical reference in general and contextual entailment in particular, play an important (though not sufficient) role in entailment recognition. null Further, we wanted to investigate the validity of the assumption that for entailment relationship to hold all content words in the hypothesis must be referred by the text. We examined the examples in our dataset which were derived from text-hypothesis pairs that were annotated as true (entailing) in the RTE dataset. Out of 257 such examples only 34 were annotated as false by both annotators. Table 2 lists a few such examples in which entailment at whole holds, however, there exists a word in the hypothesis (highlighted in the table) which is not lexically referenced by the text. In manycases, thetargetwordwaspartofanoncompositional compound in the hypothesis, and therefore should not be expected to be referenced by the text (see examples 1-2). This finding indicates that the basic assumption is a reasonable approximation for entailment. We could not have revealed this fact without the dataset for the subtask of lexical reference.</Paragraph> </Section> </Section> <Section position="7" start_page="174" end_page="175" type="metho"> <SectionTitle> 4 Lexical Reference Models </SectionTitle> <Paragraph position="0"> The lexical reference dataset facilitates qualitative and quantitative comparison of various lexical models. This section describes four state-of-the-art models that can be applied to the lexical reference task. The performance of these models was tested and analyzed, as described in the next section, using the lexical reference dataset. All models assign a [0,1] score to a given pair of text t and target word u which can be interpreted as the confidence that u is lexically referenced in t.</Paragraph> </Section> <Section position="8" start_page="175" end_page="175" type="metho"> <SectionTitle> 3 The Securities and Exchange Commission's </SectionTitle> <Paragraph position="0"> new rule to beef up the independence of mutual fund boards represents an industry defeat.</Paragraph> <Paragraph position="1"> The SEC's new rule will give boards independence. null true false</Paragraph> </Section> <Section position="9" start_page="175" end_page="176" type="metho"> <SectionTitle> 4 Texas Data Recovery is also successful at re- </SectionTitle> <Paragraph position="0"> trieving lost data from notebooks and laptops, regardless of age, make or model.</Paragraph> <Paragraph position="1"> In the event of adisasteryou could use Texas Data Recovery and you will have the capability to restore lost data.</Paragraph> <Paragraph position="2"> true false get word is shown in bold.</Paragraph> <Section position="1" start_page="175" end_page="175" type="sub_section"> <SectionTitle> 4.1 WordNet </SectionTitle> <Paragraph position="0"> Following the common practice in NLP applications (see Section 2.1) we evaluated the performance of a straight-forward utilization of WordNet's lexical information. Our wordnet model first lemmatizes the text and target word. It then assigns a score of 1 if the text contains a synonym, hyponym or derived form of the target word and a score of 0 otherwise.</Paragraph> </Section> <Section position="2" start_page="175" end_page="175" type="sub_section"> <SectionTitle> 4.2 Similarity </SectionTitle> <Paragraph position="0"> As a second measure we used the distributional similarity measure of (Lin, 1998). For a text t and a word u we assign the max similarity score as follows: null</Paragraph> <Paragraph position="2"> where sim(u,v) is the similarity score for u and v4.</Paragraph> </Section> <Section position="3" start_page="175" end_page="175" type="sub_section"> <SectionTitle> 4.3 Alignment model </SectionTitle> <Paragraph position="0"> (Glickman et al., 2006) was among the top scoring systems on the RTE-1 challenge and supplies a probabilistically motivated lexical measure based on word co-occurrence statistics. It is defined for a text t and a word u as follows:</Paragraph> <Paragraph position="2"> where P(u|v) is simply the co-occurrence probability - the probability that a sentence containing v also contains u. The co-occurrence statistics were collected from the Reuters Corpus Volume 1.</Paragraph> </Section> <Section position="4" start_page="175" end_page="175" type="sub_section"> <SectionTitle> 4.4 Baysean model </SectionTitle> <Paragraph position="0"> (Glickman et al., 2005) provide a contextual measure which takes into account the whole context of the text rather than from a single word in the text as do the previous models. This model is the only model which addresses contextual reference rather than just word-to-word matching. The modelisbasedonaNa&quot;iveBayestextclassification approach in which corpus sentences serve as documents and the class is the reference of the target word u. Sentences containing the word u are used as positive examples while all other sentences are considered as negative examples. It is defined for a text t and a word u as follows:</Paragraph> <Paragraph position="2"> where n(w,t) is the number of times word w appears in t, P(u) is the probability that a sentence containstheword u andP(v|!u)istheprobability that a sentence NOT containing u contains v. In order to reduce data size and to account for zero probabilities we applied smoothing and information gain based feature selection on the data prior to running the model. The co-occurrence probabilities were collected from sentences from the Reuters corpus in a similar manner to the alignment model.</Paragraph> </Section> <Section position="5" start_page="175" end_page="176" type="sub_section"> <SectionTitle> 4.5 Combined Model </SectionTitle> <Paragraph position="0"> The WordNet and Bayesian models are derived from quite different motivations. One would expect the WordNet model to be better in identifying the word-to-word explicit reference examples while the Bayesian model is expected to model the contextualy implied references. For this reason we triedtocombineforcesbyevaluatingana&quot;ivelinear interpolation of the two models (by simply averaging the score of the two models). This model have not been previously suggested and to the best of our knowledge this type of combination is novel.</Paragraph> </Section> </Section> <Section position="10" start_page="176" end_page="177" type="metho"> <SectionTitle> 5 Empirical Evaluation and Analysis 5.1 Results </SectionTitle> <Paragraph position="0"> In order to evaluate the scores produced by the various models as a potential component in an entailment system we compared the recall-precision graphs. In addition we compared the average precision which is a single number measure equivalent to the area under an uninterpolated recall-precision curve and is commonly used to evaluate a systems ranking ability (Voorhees and Harman, 1999). On our dataset an average precision greater than 0.65 is better than chance at the 0.05 level and an average precision greater than 0.66 is significant at the 0.01 level.</Paragraph> <Paragraph position="1"> Figure 1 compares the average precision and recall-precision results for the various models. As can be seen, the combined wordnet+bayes model performs best. In terms of average precision, the similarity and wordnet models are comparable and are slightly better than bayes. The alignment model, however, is not significantly better than random guessing. The recall-precision figure indicatesthatthebaysianmodelsucceedstorankquite null well both within the the positively scored wordnet examples and within the negatively scored word-net examples and thus resulting in improved average precision of the combined model. A better understanding of the systems' performance is evident from the following analysis.</Paragraph> <Section position="1" start_page="176" end_page="176" type="sub_section"> <SectionTitle> 5.2 Analysis </SectionTitle> <Paragraph position="0"> Table 3 lists a few examples from the lexical reference dataset along with their gold-standard annotation and the Bayesian model score. Manual inspectionofthedatashowsthattheBayesianmodel null commonly assigns a low score to correct examples which have an entailing trigger word or phrase in thesentencebutyetthecontextofthesentenceasa whole is not typical for the target hypothesized entailed word. For example, in example 5 the entailing phrase 'set in place' and in example 6 the entailing word 'founder' do appear in the text however the contexts of the sentences are not typical newsdomaincontextsofissuedorfounded. Aninteresting future work would be to change the generative story and model to account for such cases.</Paragraph> <Paragraph position="1"> The WordNet model identified a matching word in the text for 99 out of the 580 examples. This corresponds to a somewhat low recall of 25% and a quite high precision of 90%. Table 4 lists typical mistakes of the wordnet model. Examples 1-3 are false positive examples in which there is a word in the text (emphasized in the table) which is a synonym or hyponym of the target word for some sense in WordNet, however in the context of the text it is not of such a sense. Examples 4-6 show false negative examples, in which the annotators identified a trigger word in the text (emphasized in the table) but yet it or no other word in the text is a synonym or hyponym of the target word.</Paragraph> </Section> <Section position="2" start_page="176" end_page="177" type="sub_section"> <SectionTitle> 5.3 Subcategory analysis </SectionTitle> <Paragraph position="0"> auxiliary annotation.</Paragraph> <Paragraph position="1"> As seen above, the combined model outperforms the others since it identifies both word-to-word lexical reference as well as context-toword lexical reference. These are quite different cases. We asked the annotators to state the sub-category when they annotated an example as true (as described in the annotation guidelines in Section 3.2.1). The Word subcategory corresponds to a word-to-word match and Phrase and Context subcategories correspond to more than one word to word match. As can be expected, the agreement on such a task resulted in a lower Kappa of 0.5 whichcorrespondstomoderateagreement(Landis and Koch, 1997). the confusion matrix between thetwoannotatorsispresentedinTable5. Thisdecomposition enables the evaluation of the strength and weakness of different lexical reference modules, free from the context of the bigger entailment system.</Paragraph> <Paragraph position="2"> We used the subcategories dataset to test the performances of the different models. Table 6 lists for each subcategory the recall of correctly identified examples for each model's 25% recall level. The table shows that the wordnet and similarity models' strength is in identifying examples where lexical reference is triggered by a dominant word in the sentence. The bayes model, however, Figure1: comparisonofaverageprecision(left)andrecall-precision(right)resultsforthevariousmodels id text token annotation score</Paragraph> </Section> </Section> class="xml-element"></Paper>