File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/p05-1014_intro.xml
Size: 7,129 bytes
Last Modified: 2025-10-06 14:03:01
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1014"> <Title>The Distributional Inclusion Hypotheses and Lexical Entailment</Title> <Section position="3" start_page="107" end_page="108" type="intro"> <SectionTitle> 2 Background </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="107" end_page="108" type="sub_section"> <SectionTitle> 2.1 Implementations of Distribu- tional Similarity </SectionTitle> <Paragraph position="0"> This subsection reviews the relevant details of earlier methods that were utilized within this paper.</Paragraph> <Paragraph position="1"> In the computational setting contexts of words are represented by feature vectors. Each word w is represented by a feature vector, where an entry in the vector corresponds to a feature f. Each feature represents another word (or term) with which w cooccurs, and possibly specifies also the syntactic relation between the two words as in (Grefenstette, 1994; Lin, 1998; Weeds and Weir, 2003). Pado and Lapata (2003) demonstrated that using syntactic dependency-based vector space models can help distinguish among classes of different lexical relations, which seems to be more difficult for traditional &quot;bag of words&quot; co-occurrence-based models. A syntactic feature is defined as a triple <term, syntactic_relation, relation_direction> (the direction is set to 1, if the feature is the word's modifier and to 0 otherwise). For example, given the word &quot;company&quot; the feature <earnings_report, gen, 0> (genitive) corresponds to the phrase &quot;company's earnings report&quot;, and <profit, pcomp, 0> (prepositional complement) corresponds to &quot;the profit of the company&quot;. Throughout this paper we used syntactic features generated by the Minipar dependency parser (Lin, 1993).</Paragraph> <Paragraph position="2"> The value of each entry in the feature vector is determined by some weight function weight(w,f), which quantifies the degree of statistical association between the feature and the corresponding word. The most widely used association weight function is (point-wise) Mutual Information (MI) (Church and Hanks, 1990; Lin, 1998; Dagan, 2000; Weeds et al., 2004).</Paragraph> <Paragraph position="3"> <=> element, component <=> gap, spread * town, airport <= loan, mortgage</Paragraph> <Paragraph position="5"> based method of (Geffet and Dagan, 2004). Entailment judgments are marked by the arrow direction, with '*' denoting no entailment.</Paragraph> <Paragraph position="6"> Once feature vectors have been constructed, the similarity between two words is defined by some vector similarity metric. Different metrics have been used, such as weighted Jaccard (Grefenstette, 1994; Dagan, 2000), cosine (Ruge, 1992), various information theoretic measures (Lee, 1997), and the widely cited and competitive (see (Weeds and Weir, 2003)) measure of Lin (1998) for similarity between two words, w and v, defined as follows: where F(w) and F(v) are the active features of the two words (positive feature weight) and the weight function is defined as MI. As typical for vector similarity measures, it assigns high similarity scores if many of the two word's features overlap, even though some prominent features might be disjoint. This is a major reason for getting such semantically loose similarities, like company government and country - economy.</Paragraph> <Paragraph position="7"> Investigating the output of Lin's (1998) similarity measure with respect to the above criterion in (Geffet and Dagan, 2004), we discovered that the quality of similarity scores is often hurt by inaccurate feature weights, which yield rather noisy feature vectors. Hence, we tried to improve the feature weighting function to promote those features that are most indicative of the word meaning. A new weighting scheme was defined for bootstrapping feature weights, termed RFF (Relative Feature Focus). First, basic similarities are generated by Lin's measure. Then, feature weights are recalculated, boosting the weights of features that characterize many of the words that are most similar to the given one2. As a result the most prominent features of a word are concentrated within the top-100 entries of the vector. Finally, word similarities are recalculated by Lin's metric over the vectors with the new RFF weights.</Paragraph> <Paragraph position="8"> The lexical entailment prediction task of (Geffet and Dagan, 2004) measures how many of the top ranking similarity pairs produced by the 2 In concrete terms RFF is defined by: [?][?]= ),()()(),( vwsimwNfWSvfwRFF , where sim(w,v) is an initial approximation of the similarity space by Lin's measure, WS(f) is a set of words co-occurring with feature f, and N(w) is the set of the most similar words of w by Lin's measure.</Paragraph> <Paragraph position="9"> RFF-based metric hold the entailment relation, in at least one direction. To this end a data set of 1,200 pairs was created, consisting of top-N (N=40) similar words of 30 randomly selected nouns, which were manually judged by the lexical entailment criterion. Quite high Kappa agreement values of 0.75 and 0.83 were reported, indicating that the entailment judgment task was reasonably well defined. A subset of the data set is demonstrated in Table 1.</Paragraph> <Paragraph position="10"> The RFF weighting produced 10% precision improvement over Lin's original use of MI, suggesting the RFF capability to promote semantically meaningful features. However, over 47% of the word pairs in the top-40 similarities are not related by entailment, which calls for further improvement. In this paper we use the same data set 3 and the RFF metric as a basis for our experiments.</Paragraph> </Section> <Section position="2" start_page="108" end_page="108" type="sub_section"> <SectionTitle> 2.2 Predicting Semantic Inclusion </SectionTitle> <Paragraph position="0"> Weeds et al. (2004) attempted to refine the distributional similarity goal to predict whether one term is a generalization/specification of the other.</Paragraph> <Paragraph position="1"> They present a distributional generality concept and expect it to correlate with semantic generality.</Paragraph> <Paragraph position="2"> Their conjecture is that the majority of the features of the more specific word are included in the features of the more general one. They define the feature recall of w with respect to v as the weighted proportion of features of v that also appear in the vector of w. Then, they suggest that a hypernym would have a higher feature recall for its hyponyms (specifications), than vice versa.</Paragraph> <Paragraph position="3"> However, their results in predicting the hyponymy-hyperonymy direction (71% precision) are comparable to the naive baseline (70% precision) that simply assumes that general words are more frequent than specific ones. Possible sources of noise in their experiment could be ignoring word polysemy and data sparseness of word-feature co-occurrence in the corpus.</Paragraph> </Section> </Section> class="xml-element"></Paper>