File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1014_metho.xml
Size: 18,022 bytes
Last Modified: 2025-10-06 14:09:41
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1014"> <Title>The Distributional Inclusion Hypotheses and Lexical Entailment</Title> <Section position="4" start_page="108" end_page="108" type="metho"> <SectionTitle> 3 The Distributional Inclusion Hy- </SectionTitle> <Paragraph position="0"> potheses In this paper we suggest refined versions of the distributional similarity hypothesis which relate distributional behavior with lexical entailment.</Paragraph> </Section> <Section position="5" start_page="108" end_page="109" type="metho"> <SectionTitle> 3 Since the original data set did not include the direction of entailment, we have </SectionTitle> <Paragraph position="0"> enriched it by adding the judgments of entailment direction.</Paragraph> <Paragraph position="1"> Extending the rationale of Weeds et al., we suggest that if the meaning of a word v entails another word w then it is expected that all the typical contexts (features) of v will occur also with w. That is, the characteristic contexts of v are expected to be included within all w's contexts (but not necessarily amongst the most characteristic ones for w). Conversely, we might expect that if v's characteristic contexts are included within all w's contexts then it is likely that the meaning of v does entail w. Taking both directions together, lexical entailment is expected to highly correlate with characteristic feature inclusion.</Paragraph> <Paragraph position="2"> Two additional observations are needed before concretely formulating these hypotheses. As explained in Section 2, word contexts should be represented by syntactic features, which are more restrictive and thus better reflect the restrained semantic meaning of the word (it is difficult to tie entailment to looser context representations, such as co-occurrence in a text window). We also notice that distributional similarity principles are intended to hold at the sense level rather than the word level, since different senses have different characteristic contexts (even though computational common practice is to work at the word level, due to the lack of robust sense annotation).</Paragraph> <Paragraph position="3"> We can now define the two distributional inclusion hypotheses, which correspond to the two directions of inference relating distributional feature inclusion and lexical entailment. Let vi and wj be two word senses of the words w and v, correspondingly, and let vi => wj denote the (directional) entailment relation between these senses. Assume further that we have a measure that determines the set of characteristic features for the meaning of each word sense. Then we would hypothesize:</Paragraph> <Section position="1" start_page="109" end_page="109" type="sub_section"> <SectionTitle> Hypothesis I: </SectionTitle> <Paragraph position="0"> If vi => wj then all the characteristic (syntacticbased) features of vi are expected to appear with wj.</Paragraph> </Section> <Section position="2" start_page="109" end_page="109" type="sub_section"> <SectionTitle> Hypothesis II: </SectionTitle> <Paragraph position="0"> If all the characteristic (syntactic-based) features of vi appear with wj then we expect that vi => wj.</Paragraph> </Section> </Section> <Section position="6" start_page="109" end_page="110" type="metho"> <SectionTitle> 4 Word Level Testing of Feature In- clusion </SectionTitle> <Paragraph position="0"> To check the validity of the hypotheses we need to test feature inclusion. In this section we present an automated word-level feature inclusion testing method, termed ITA (Inclusion Testing Algorithm).</Paragraph> <Paragraph position="1"> To overcome the data sparseness problem we incorporated web-based feature sampling. Given a test pair of words, three main steps are performed, as detailed in the following subsections: for each word.</Paragraph> <Paragraph position="2"> Step 2: Testing feature inclusion for each pair, in both directions, within the given corpus data. Step 3: Complementary testing of feature inclusion for each pair in the web.</Paragraph> <Section position="1" start_page="109" end_page="109" type="sub_section"> <SectionTitle> 4.1 Step 1: Corpus-based generation </SectionTitle> <Paragraph position="0"> of characteristic features To implement the first step of the algorithm, the RFF weighting function is exploited and its top-100 weighted features are taken as most characteristic for each word. As mentioned in Section 2, (Geffet and Dagan, 2004) shows that RFF yields high concentration of good features at the top of the vector.</Paragraph> </Section> <Section position="2" start_page="109" end_page="109" type="sub_section"> <SectionTitle> 4.2 Step 2: Corpus-based feature </SectionTitle> <Paragraph position="0"> inclusion test We first check feature inclusion in the corpus that was used to generate the characteristic feature sets. For each word pair (w, v) we first determine which features of w do co-occur with v in the corpus. The same is done to identify features of v that co-occur with w in the corpus.</Paragraph> </Section> <Section position="3" start_page="109" end_page="110" type="sub_section"> <SectionTitle> 4.3 Step 3: Complementary Web- </SectionTitle> <Paragraph position="0"> based Inclusion Test This step is most important to avoid inclusion misses due to the data sparseness of the corpus. A few recent works (Ravichandran and Hovy, 2002; Keller et al., 2002; Chklovski and Pantel, 2004) used the web to collect statistics on word cooccurrences. In a similar spirit, our inclusion test is completed by searching the web for the missing (non-included) features on both sides. We call this web-based technique mutual web-sampling. The web results are further parsed to verify matching of the feature's syntactic relationship.</Paragraph> <Paragraph position="1"> We denote the subset of w's features that are missing for v as M(w, v) (and equivalently M(v, w)). Since web sampling is time consuming we randomly sample a subset of k features (k=20 in our experiments), denoted as M(v,w,k).</Paragraph> <Paragraph position="2"> Mutual Web-sampling Procedure: For each pair (w, v) and their k-subsets M(w, v, k) and M(v, w, k) execute:</Paragraph> </Section> </Section> <Section position="7" start_page="110" end_page="110" type="metho"> <SectionTitle> 1. Syntactic Filtering of &quot;Bag-of-Words&quot; Search: </SectionTitle> <Paragraph position="0"> Search the web for sentences including v and a feature f from M(w, v, k) as &quot;bag of words&quot;, i. e. sentences where w and f appear in any distance and in either order. Then filter out the sentences that do not match the defined syntactic relation between f and v (based on parsing). Features that co-occur with w in the correct syntactic relation are removed from M(w, v, k). Do the same search and filtering for w and features from M(v, w, k).</Paragraph> </Section> <Section position="8" start_page="110" end_page="110" type="metho"> <SectionTitle> 2. Syntactic Filtering of &quot;Exact String&quot; Matching: </SectionTitle> <Paragraph position="0"> On the missing features on both sides (which are left in M(w, v, k) and M(v, w, k) after stage 1), apply &quot;exact string&quot; search of the web. For this, convert the tuple (v, f) to a string by adding prepositions and articles where needed. For example, for (element, <project, pcomp_of, 1>) generate the corresponding string &quot;element of the project&quot; and search the web for exact matches of the string. Then validate the syntactic relationship of f and v in the extracted sentences. Remove the found features from M(w, v, k) and M(v, w, k), respectively. null</Paragraph> </Section> <Section position="9" start_page="110" end_page="110" type="metho"> <SectionTitle> 3. Missing Features Validation: </SectionTitle> <Paragraph position="0"> Since some of the features may be too infrequent or corpus-biased, check whether the remaining missing features do co-occur on the web with their original target words (with which they did occur in the corpus data). Otherwise, they should not be considered as valid misses and are also removed from M(w, v, k) and M(v, w, k).</Paragraph> <Paragraph position="1"> Output: Inclusion in either direction holds if the corresponding set of missing features is now empty.</Paragraph> <Paragraph position="2"> We also experimented with features consisting of words without syntactic relations. For example, exact string, or bag-of-words match. However, almost all the words (also non-entailing) were found with all the features of each other, even for semantically implausible combinations (e.g. a word and a feature appear next to each other but belong to different clauses of the sentence). Therefore we conclude that syntactic relation validation is very important, especially on the web, in order to avoid coincidental co-occurrences.</Paragraph> </Section> <Section position="10" start_page="110" end_page="112" type="metho"> <SectionTitle> 5 Empirical Results </SectionTitle> <Paragraph position="0"> To test the validity of the distributional inclusion hypotheses we performed an empirical analysis on a selected test sample using our automated testing procedure.</Paragraph> <Section position="1" start_page="110" end_page="110" type="sub_section"> <SectionTitle> 5.1 Data and setting </SectionTitle> <Paragraph position="0"> We experimented with a randomly picked test sample of about 200 noun pairs of 1,200 pairs produced by RFF (for details see Geffet and Dagan, 2004) under Lin's similarity scheme (Lin, 1998).</Paragraph> <Paragraph position="1"> The words were judged by the lexical entailment criterion (as described in Section 2). The original percentage of correct (52%) and incorrect (48%) entailments was preserved.</Paragraph> <Paragraph position="2"> To estimate the degree of validity of the distributional inclusion hypotheses we decomposed each word pair of the sample (w, v) to two directional pairs ordered by potential entailment direction: (w, v) and (v, w). The 400 resulting ordered pairs are used as a test set in Sections 5.2 and 5.3.</Paragraph> <Paragraph position="3"> Features were computed from co-occurrences in a subset of the Reuters corpus of about 18 million words. For the web feature sampling the maximal number of web samples for each query (word feature) was set to 3,000 sentences.</Paragraph> </Section> <Section position="2" start_page="110" end_page="111" type="sub_section"> <SectionTitle> 5.2 Automatic Testing the Validity </SectionTitle> <Paragraph position="0"> of the Hypotheses at the Word Level The test set of 400 ordered pairs was examined in terms of entailment (according to the manual judgment) and feature inclusion (according to the ITA algorithm), as shown in Table 2.</Paragraph> <Paragraph position="1"> According to Hypothesis I we expect that a pair (w, v) that satisfies entailment will also preserve feature inclusion. On the other hand, by Hypothesis II if all the features of w are included by v then we expect that w entails v.</Paragraph> <Paragraph position="2"> We observed that Hypothesis I is better attested by our data than the second hypothesis. Thus 86% (97 out of 113) of the entailing pairs fulfilled the inclusion condition. Hypothesis II holds for approximately 70% (97 of 139) of the pairs for which feature inclusion holds. In the next section we analyze the cases of violation of both hypotheses and find that the first hypothesis held to an almost perfect extent with respect to word senses.</Paragraph> <Paragraph position="3"> It is also interesting to note that thanks to the web-sampling procedure over 90% of the non-included features in the corpus were found on the web, while most of the missing features (in the web) are indeed semantically implausible.</Paragraph> </Section> <Section position="3" start_page="111" end_page="112" type="sub_section"> <SectionTitle> 5.3 Manual Sense Level Testing of Hypotheses Validity </SectionTitle> <Paragraph position="0"> Since our data was not sense tagged, the automatic validation procedure could only test the hypotheses at the word level. In this section our goal is to analyze the findings of our empirical test at the word sense level as our hypotheses were defined for senses. Basically, two cases of hypotheses invalidity were detected: At the word level we observed 14% invalid pairs of the first case and 30% of the second case. However, our manual analysis shows, that over 90% of the first case pairs were due to a different sense of one of the entailing word, e.g. capital - town (capital as money) and spread - gap (spread as distribution) (Table 3). Note that ambiguity of the entailed word does not cause errors (like town - area, area as domain) (Table 3). Thus the first hypothesis holds at the sense level for over 98% of the cases (Table 4).</Paragraph> <Paragraph position="1"> Two remaining invalid instances of the first case were due to the web sampling method limitations and syntactic parsing filtering mistakes, especially for some less characteristic and infrequent features captured by RFF. Thus, in virtually all the examples tested in our experiment Hypothesis I was valid.</Paragraph> <Paragraph position="2"> We also explored the second case of invalid pairs: non-entailing words that pass the feature inclusion test. After sense based analysis their percentage was reduced slightly to 27.4%. Three possible reasons were discovered. First, there are words with features typical to the general meaning of the domain, which tend to be included by many other words of this domain, like valley - town. The features of valley (&quot;eastern valley&quot;, &quot;central valley&quot;, &quot;attack in valley&quot;, &quot;industry of the valley&quot;) are not discriminative enough to be distinguished from town, as they are all characteristic to any geographic location.</Paragraph> <Paragraph position="3"> entailing ordered pairs that hold/do not hold feature inclusion at the sense level.</Paragraph> <Paragraph position="4"> spread - gap (mutually entail each other) <weapon, pcomp_of> The Committee was discussing the Programme of the &quot;Big Eight,&quot; aimed against spread of weapon of mass destruction.</Paragraph> <Paragraph position="5"> town - area (&quot;town&quot; entails &quot;area&quot;) <cooperation, pcomp_for> This is a promising area for cooperation and exchange of experiences.</Paragraph> <Paragraph position="6"> capital - town (&quot;capital&quot; entails &quot;town&quot;) <flow, nn> Offshore financial centers affect cross-border capital flow in China.</Paragraph> <Paragraph position="7"> related words, where the disjoint features belong to a different sense of the word.</Paragraph> <Paragraph position="8"> The second group consists of words that can be entailing, but only in a context-dependent (anaphoric) manner rather than ontologically. For example, government and neighbour, while neighbour is used in the meaning of &quot;neighbouring (country) government&quot;. Finally, sometimes one or both of the words are abstract and general enough and also highly ambiguous to appear with a wide range of features on the web, like element (violence - element, with all the tested features of violence included by element).</Paragraph> <Paragraph position="9"> To prevent occurrences of the second case more characteristic and discriminative features should be provided. For this purpose features extracted from the web, which are not domain-biased (like features from the corpus) and multi-word features may be helpful. Overall, though, there might be inherent cases that invalidate Hypothesis II.</Paragraph> </Section> </Section> <Section position="11" start_page="112" end_page="112" type="metho"> <SectionTitle> 6 Improving Lexical Entailment Pre- </SectionTitle> <Paragraph position="0"> diction by ITA (Inclusion Testing Algorithm) In this section we show that ITA can be practically used to improve the (non-directional) lexical entailment prediction task described in Section 2. Given the output of the distributional similarity method, we employ ITA at the word level to filter out non-entailing pairs. Word pairs that satisfy feature inclusion of all k features (at least in one direction) are claimed as entailing.</Paragraph> <Paragraph position="1"> The same test sample of 200 word pairs mentioned in Section 5.1 was used in this experiment. The results were compared to RFF under Lin's similarity scheme (RFF-top-40 in Table 5).</Paragraph> <Paragraph position="2"> Precision was significantly improved, filtering out 60% of the incorrect pairs. On the other hand, the relative recall (considering RFF recall as 100%) was only reduced by 13%, consequently leading to a better relative F1, when considering the RFF-top-40 output as 100% recall (Table 5).</Paragraph> <Paragraph position="3"> Since our method removes about 35% of the original top-40 RFF output, it was interesting to compare our results to simply cutting off the 35% of the lowest ranked RFF words (top-26). The comparison to the baseline (RFF-top-26 in Table 5) showed that ITA filters the output much better than just cutting off the lowest ranking similarities. We also tried a couple of variations on feature sampling for the web-based procedure. In one of our preliminary experiments we used the top-k RFF features instead of random selection. But we observed that top ranked RFF features are less discriminative than the random ones due to the nature of the RFF weighting strategy, which promotes features shared by many similar words. Then, we attempted doubling the sampling to 40 random features. As expected the recall was slightly decreased, while precision was increased by over 5%. In summary, the behavior of ITA sampling of k=20 and k=40 features is closely comparable (ITA-20 and ITA-40 in Table 5, respectively)4.</Paragraph> </Section> class="xml-element"></Paper>