File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-1010_metho.xml
Size: 24,077 bytes
Last Modified: 2025-10-06 14:08:52
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-1010"> <Title>Acquiring Hyponymy Relations from Web Documents</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Acquisition Algorithm </SectionTitle> <Paragraph position="0"> Our acquisition algorithm consists of four steps, as explained in this section.</Paragraph> <Paragraph position="1"> Step 1 Extraction of hyponym candidates from itemized expressions in HTML documents.</Paragraph> <Paragraph position="2"> Step 2 Selection of a hypernym candidate with respect to df and idf.</Paragraph> <Paragraph position="3"> Step 3 Ranking of hypernym candidates and HCSs based on semantic similarities between hypernym and hyponym candidates.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Step 1: Extraction of hyponym candidates </SectionTitle> <Paragraph position="0"> The objective of Step 1 is to extract an HCS, which is a set of hyponym candidates that may have a common hypernym, from the itemizations or lists in HTML documents.</Paragraph> <Paragraph position="1"> Many methods can be used to do this. Our approach is a simple one. Each expression in an HTML document can be associated with a path, which specifies both the HTML tags that enclose the expression and the order of the tags. Consider the HTML document in Figure 2. The expression &quot;Car Specification&quot; is enclosed by the tags <LI>,</LI> and <UL>,</UL>. If we sort these tags according to their nesting order, we obtain a path (UL, LI) and this path specifies the information regarding the place of the expression. We write <(UL, LI),Car Specification> if (UL, LI) is a path for the expression &quot;Car Specification&quot;. We can then obtain the following paths for the expressions from the document.</Paragraph> <Paragraph position="2"> <(UL, LI),Car Specification> , <(UL, UL, LI),Toyota> , <(UL, UL, LI),Honda> , <(UL, UL, LI),Nissan> Basically, our method extracts the set of expressions associated with the same path as an HCS 1. In the above example, we can obtain the HCS {Toyota,Honda,***}.</Paragraph> <Paragraph position="3"> We extract an itemization only when its size is n and 3 < n < 20. This is because the processing of large itemizations (particularly the downloading of the related documents) is time-consuming, and small itemizations are often used to obtain a proper layout in HTML doc- null ization but do not have common semantic properties with other items in the same itemization, during the experiments using a development set. &quot;(links)&quot; and &quot;(help)&quot; are examples of such words. We prepared a list of such words consisting of 70 items, and removed them from the HCSs obtained in Step 1.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Step 2: Selection of a hypernym candidate by df </SectionTitle> <Paragraph position="0"> and idf In Step 1, we can obtain a set of hyponym candidates, an HCS, that may have a common hypernym. In Step 2, we select a common hypernym candidate for an HCS. First, we prepare two sets of documents. We randomly select a large number of HTML documents and download them. We call this set of documents a global document set. We assume this document set indicates the general tendencies of word frequencies. Then we download the documents including each hyponym candidate in a given HCS. This document set is called a local document set, and we use it to know the strength of the association of nouns with the hyponym candidates.</Paragraph> <Paragraph position="1"> Let us denote a given HCS as C, a local document set obtained from all the items in C as LD(C), and a global document set as G. We also assume that N is a set of words, which can be candidates of hypernym3.</Paragraph> <Paragraph position="2"> A hypernym candidate, denoted as h(C), for C is obtained through the following formula, where df(n,D) is the number of documents that include a noun n in a document set D.</Paragraph> <Paragraph position="4"> The score has a large value for a noun that appears in a large number of documents in the local document set and is found in a relatively small number of documents in the global document set.</Paragraph> <Paragraph position="5"> In general, nouns strongly associated with many items in a given HCS tend to be selected through the above formula. Since hyponym candidates tend to share a common semantic property, and their hypernym is one of the words strongly associated with the common property, the hypernym is likely to be picked up through the above formula. Note that a process of generalization is performed automatically by treating all the hyponym candidates in an HCS simultaneously. That is, words strongly connected with only one hyponym candidate (for instance, &quot;Lexus&quot; for Toyota) have relatively low score values since we obtain statistical measures from all the local document sets for all the hyponym candidates in an HCS.</Paragraph> <Paragraph position="6"> Nevertheless, this scoring method is a weak method in one sense. There could be many non-hypernyms that are 3In our experiments, N is a set consisting of 37,639 words, each of which appeared more than 500 times in 33 years of Japanese newspaper articles (Yomiuri newspaper 1987-2001, Mainichi newspaper 1991-1999 and Nikkei newspaper 19831990; 3.01 GB in total). We excluded 116 nouns that we observed never be hypernyms from N. An example of such noun is &quot;{(I)&quot;. We found them in the experiments using a development set.</Paragraph> <Paragraph position="7"> strongly associated with many of the hyponym candidates (for instance, &quot;price&quot; for Toyota and Honda). Such non-hypernyms are dealt with in the next step.</Paragraph> <Paragraph position="8"> An evident alternative to this method is to use tf(n,LD(C)), which is the frequency of a noun n in the local document set, instead of df(n,LD(C)). We tried using this method in our experiments, but it produced less accurate results, as we show in Section 3.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Step 3: Ranking of hypernym candidates and </SectionTitle> <Paragraph position="0"> HCSs by semantic similarity Thus, our procedure can produce pairs consisting of a hypernym candidate and an HCS, which are denoted by {<h(C1),C1> ,<h(C2),C2> ,***,<h(Cm),Cm> }.</Paragraph> <Paragraph position="1"> Here, C1,***,Cm are HCSs, and h(Ci) is a common hypernym candidate for hyponym candidates in an HCS Ci. In Step 3, our procedure ranks these pairs by using the semantic similarity between h(Ci) and the items in Ci. The final output of our procedure is the top k pairs in this ranking after some heuristic rules are applied to it in Step 4. In other words, the procedure discards the remaining m [?] k pairs in the ranking because they tend to include erroneous hypernyms.</Paragraph> <Paragraph position="2"> As mentioned, we cannot exclude non-hypernyms that are strongly associated with hyponym candidates from the hypernym candidate obtained by h(C). For example, the value of h(C) may be a non-hypernym &quot;price&quot;, rather than &quot;company&quot;, when C = {Toyota,Honda}. The objective of Step 3 is to exclude such non-hypernyms from the output of our procedure. We expect such non-hypernyms to have relatively low semantic similarities to the hyponym candidates, while the behavior of true hypernyms should be semantically similar to the hyponyms.</Paragraph> <Paragraph position="3"> If we rank the pairs of hypernym candidates and HCSs according to their semantic similarities, the low ranked pairs are likely to have an erroneous hypernym candidate.</Paragraph> <Paragraph position="4"> We can then obtain relatively precise hypernyms by discarding the low ranked pairs.</Paragraph> <Paragraph position="5"> The similarities are computed through the following steps. First, we parse all the texts in the local document set, and check the argument positions of verbs where hyponym candidates appear. (To parse texts, we use a downgraded version of an existing parser (Kanayama et al., 2000) throughout this work.) Let us denote the frequency of the hyponym candidates in an HCS C occupying an argument position p of a verb v as fhypo(C,p,v). Assume that all possible argument positions are denoted as {p1,***,pl} and all the verbs as {v1,***,vm}. We then define the co-occurrence vector of hyponym candidates as follows.</Paragraph> <Paragraph position="7"> In the same way, we can define the co-occurrence vector of a hypernym candidate n.</Paragraph> <Paragraph position="8"> hyperv(n) = <f(n,p1,v1),***,f(n,pl,vm)> Here, f(n,p,v) is the frequency of a noun n occupying an argument position p of a verb v obtained from the parsing results of a large number of documents - 33 years of Japanese newspaper articles (Yomiuri newspaper 19872001, Mainichi newspaper 1991-1999, and Nikkei newspaper 1990-1998; 3.01 GB in total) - in our experimental setting.</Paragraph> <Paragraph position="9"> The semantic similarities between hyponym candidates in C and a hypernym candidate n are then computed by a cosine measure between the vectors:</Paragraph> <Paragraph position="11"> Our procedure sorts the hypernym-HCS pairs {<h(Ci),Ci> }mi=1 using the value sim(h(Ci),Ci)*df(h(Ci),LD(Ci))*idf(h(Ci),G) Note that we consider not only the similarity but also the df *idf score used in Step 2 in the sorting.</Paragraph> <Paragraph position="12"> An evident alternative to the above method is the algorithm that re-ranks the top j hypernym candidates obtained by df * idf for a given HCS by using the same score. However, we found no significant improvement when this alternative was used in our experiments, as we later explain.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 Step 4: Application of other heuristic rules </SectionTitle> <Paragraph position="0"> The procedure described up to now can produce a hypernym for hyponym candidates with a certain precision. We found, though, that we can improve accuracy by using a few more heuristic rules, which are listed below.</Paragraph> <Paragraph position="1"> Rule 1 If the number of documents that include a hypernym candidate is less than the sum of the numbers of the documents that include an item in the HCS, then discard both the hypernym candidate and the HCS from the output.</Paragraph> <Paragraph position="2"> Rule 2 If a hypernym candidate appears as substrings of an item in its HCS and it is not a suffix of the item, then discard both the hypernym candidate and the HCS from the output. If a hypernym candidate is a suffix of its hyponym candidate, then half of the members of an HCS must have the hypernym candidate as their suffixes. Otherwise, discard both the hypernym candidate and its HCS from the output.</Paragraph> <Paragraph position="3"> Rule 3 If a hypernym candidate is an expression belonging to the category of place names, then replace it by &quot;place name&quot;.</Paragraph> <Paragraph position="4"> In general, we can expect that a hypernym is used in a wider range of contexts than those of its hyponyms, and that the number of documents including the hypernym candidate should be larger than the number of web documents including hyponym candidates. This justifies Rule 1. We use the hit counts given by an existing search engine as the number of documents including an expression. null As for Rule 2, note that Japanese is a head final language, and a semantic head of a complex noun phrase is the last noun. Consider the following two Japanese complex nouns.</Paragraph> <Paragraph position="6"> Apparently an American movie is a kind of movie as is a Japanese movie. There are many multi-word expressions whose hypernyms are their suffixes, and if some expressions share a common suffix, it is likely to be their hypernym. However, if a hypernym candidate appears in a position other than as a suffix of a hyponym candidate, the hypernym candidate is likely to be an erroneous one.</Paragraph> <Paragraph position="7"> In addition, if a hypernym candidate is a common suffix of only a small portion of an HCS, then the HCS tends not to have semantic uniformity, and such a hypernym candidate should be eliminated from the output. (We empirically determined &quot;one-half&quot; as a threshold in our experiments on the development set.) As for Rule 3, in our experiments on a development set, we found that our procedure could not provide precise hypernyms for place names such as &quot;Kyoto&quot; and &quot;Tokyo&quot;. In the case of Kyoto and Tokyo, our procedure produced &quot;Japan&quot; as a hypernym candidate. Although &quot;Japan&quot; is consistent with most of our assumptions regarding hypernyms, it is a holonym of Kyoto and Tokyo, but their hypernym. In general, when a set of place names is given as an HCS, the procedure tends to produce the name of the region or area that includes all the places designated by the hyponym candidates. We then added the rule to replace such place names by the expression &quot;place name,&quot; which is a true hypernym in many of such cases 4.</Paragraph> <Paragraph position="8"> Recall that we obtained the ranked pairs of an HCS and its common hypernym in Step 3. By applying the above rules, some pairs are removed from the ranked pairs, or are modified. For some given integer k, the top k pairs of the obtained ranked pairs become the final output of our procedure, as mentioned before.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Experimental Results </SectionTitle> <Paragraph position="0"> We downloaded about 8.71 x 105 HTML documents (10.4 GB with HTML tags), and extracted 9.02 x 104 randomly picked 2,000 HCSs from among the extracted HCS as our test set. The test set contained 13,790 hyponym candidates. (Besides these HCSs, we used a development set consisting of about 4,000 HCSs to develop our algorithm.) For each single hyponym candidate, we downloaded the top 100 documents in the ranking produced by a search engine5 as a local document set if the engine found more than 100 documents. Otherwise, all the documents were downloaded. (Note that a local document set for an HCS may contain more than 100 documents.) As a global document set, we used the downloaded 1.00 x 106 HTML documents (1.26 GB without HTML tags).</Paragraph> <Paragraph position="1"> Fig. 3 shows the accuracy of hypernyms obtained after Steps 2, 3, and 4. We assumed each step produced the sorted pairs of an HCS and a hypernym, which are denoted by {<h(C1),C1> ,<h(C2),C2> ,***,<h(Cm),Cm> }. The sorting was done by the score sim(h(Ci),Ci) * df(h(Ci),LD(Ci)) * idf(h(Ci),G) after Steps 3 and 4, as described before, while the output of Step 2 was sorted by the df * idf score. In addition, we assumed pairs. (Since the output of Step 4 is the final output, this means that we also assumed that only the top 200 pairs of a hypernym and an HCS would be produced as final output with our procedure. In other words, the remaining 1,800 (=2,000-200) pairs were discarded. ) The resulting hypernyms were checked by the authors according to the definition of the hypernym given in Miller et al., 1990, i.e., we checked if the expression &quot;a hyponym candidate is a kind of a hypernym candidate.&quot; is acceptable. Then, we computed the precision, which is the ratio of the correct hypernym-hyponym pairs against all the pairs obtained from the top n pairs of an HCS and its hypernym candidate. The x-axis of the graph indicates the number of hypernym-hyponym pairs obtained from the top n pairs of an HCS and its hypernym candidate, while the y-axis indicates the precision.</Paragraph> <Paragraph position="2"> More precisely, the curve for Step i plots the following points, where 1 [?] j [?] 200.</Paragraph> <Paragraph position="4"> correct(Ck,h(Ck)) indicates the number of hyponym candidates in Ck that are true hyponyms of h(Ck). Note that after Step 4, the precision reached about 75% for 701 hyponym candidates, which was slightly more than 5% of all the given hyponym candidates. For 1398 hyponym candidates (about 10% of all the candidates), the precision was about 61%.</Paragraph> <Paragraph position="5"> Another important point is that &quot;Step 2 (tf)&quot; in the graph refers to an alternative to our Step 2 procedure; i.e., the Step 2 procedure in which df(h(C),LD(C)) was replaced by tf(h(C),LD(C)). One can see the Step 2 procedure with df works better than that with tf .</Paragraph> <Paragraph position="6"> Table 1 shows some examples of the acquired HCSs and their common hypernyms. Recall that a common suffix of an HCS is a good candidate to be a hypernym. The examples were taken from cases where a common suffix hypernymhyponym,hyponym .*w.* hypernym, hyponym .*sr(z|w)? hypernym, hyponym .*wOs.* hypernym, hyponym .*th.* hypernym, hyponym .*q(M|t)O.* hypernym, hyponym .*qzy.* hypernym, hyponym .* (|hj) .* hypernym of an HCS was not produced as a hypernym. This list is actually the output of Step 3, and shows which HCSs and their hypernym candidates were eliminated/modified from the output in Step 4 and which rule was fired to eliminate/modify them.</Paragraph> <Paragraph position="7"> Next, we eliminated some steps from the whole procedure. Figure 4 shows the accuracy when one of the steps was eliminated from the procedure. &quot;-Step X&quot; or &quot;-Rule X&quot; refers to the accuracies obtained through the procedure from which step X or rule X were eliminated. Note that both graphs indicate that every step and rule contributed to the improvement of the precision.</Paragraph> <Paragraph position="8"> Figure 5 compares our method and an alternative method, which was the algorithm that re-ranks the top j hypernym candidates for a given HCS by using the score sim(h,C) * df(h,LD(C)) * idf(h,G), where h is a hypernym candidate, in Step 3. (Recall that our algorithm uses the score only for sorting pairs of HCSs and their hypernym. In other words, we do not re-rank the hypernym candidates for a single HCS.) We found no significant improvement when the alternative was used.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Comparison with alternative methods </SectionTitle> <Paragraph position="0"> We have shown that our assumptions are effective for acquiring hypernyms. However, there are other alternative methods applicable under our settings. We evaluated the followings methods and compared the results with those of our procedure.</Paragraph> <Paragraph position="1"> Alternative 1 Compute the non-null suffixes that are shared by the maximum number of hyponym candidates, and regard the longest as a hypernym candidate. null Alternative 2 Extract hypernyms for hyponym candidates by looking at the captions or titles of the itemizations from which hyponym candidates are extracted. null Alternative 3 Extract hypernyms by using lexicosyntactic patterns.</Paragraph> <Paragraph position="2"> Alternative 4 Combinations of Alternative 1-3.</Paragraph> <Paragraph position="3"> The evaluation method for Alternative 1 and Alternative 2 is the same as the one for our method. We simply judged if the produced hypernyms are acceptable or not. But we used different evaluation method for the other alternatives. We checked if the correct hypernyms produced by our method can be found by these alternatives. This is simply for the sake of easiness of the evaluation. Note that we evaluated Alternative 1 and Alternative 2 in the second evaluation scheme when they are combined and are used as a part of Alternative 4.</Paragraph> <Paragraph position="4"> More detailed explanations on the alternative methods are given below.</Paragraph> <Paragraph position="5"> Alternative 1 Recall that Japanese is a head final language, and we have explained that common suffixes of hyponym candidates are good candidates to be common hyponyms. Alternative 1 computes a hypernym candidate according to this principle.</Paragraph> <Paragraph position="6"> Alternative 2 This method uses the captions of the itemizations, which are likely to contain a hypernym of the items in the itemization. We manually found captions or titles that are in the position such that they can explain the content of the itemization, and picked up the caption closest to the itemization and the second closest to it. Then, we checked if the picked-up captions included the proper hypernyms. Note that the precision obtained by this method is just an upper bound of real performance because we do not have a method to extract hypernyms from captions at least at the current stage of our research. Alternative 3 We prepared the lexicosyntactic patterns in Fig. 6, which are similar to the ones used in the previous studies of hypernym acquisition in Japanese (Imasumi, 2001; Ando et al., 2003). One difference from the previous studies was that we used a regular expression instead of a parser. This may have caused some errors, but our patterns were more generous than those used in the previous studies, and did not miss the expressions matched to the patterns from the previous studies. In other words, the accuracy obtained with our patterns was an upper bound on the performance obtained by the previous proposal. Another difference was that the procedure was given correct pairs of a hypernym and a hyponym computed beforehand using our proposed method, and it only checked whether given pairs could be found by using the lexicosyntactic patterns from given texts. In other words, this alternative method checked if the lexicosyntactic patterns could find the hypernym-hyponym pairs successfully obtained by our procedure. The texts used were local document sets from which our procedure computed a hypernym candidate. If our procedure has better figures than this method, this means that our procedure can produce hypernyms that cannot be acquired by patterns, at least, from a rather small number of texts (i.e., a maximum of 100 documents per hyponym candidate).</Paragraph> <Paragraph position="7"> Alternative 4 We also compared our procedure with the combination of all the above methods: Alternative 4. Again, we checked whether the combination could find the correct hypernym-hyponym pairs provided by our method. The difference between the precision of our method and that of Alternative 4 reflects the number of hypernym-hyponym pairs that our method could acquire and that Alternative 4 could not. We assumed that for a given HCS a hypernym was successfully acquired if one of the above methods could find the correct hypernym. In other words, the performance of Alternative 4 would be achieved only when there were a technique to combine the output of the above methods in an optimal way.</Paragraph> <Paragraph position="8"> and the alternative methods. We plotted the graph assuming the pairs of hypernym candidates and hyponym candidates were sorted in the same order as the order obtained by our procedure6. The results suggest that our method can acquire a significant number of hypernyms that the alternative methods cannot obtain, when we gave rather small amount of texts, a maximum of 100 documents per hyponym candidate, as in our current experimental settings. There is possibility that the difference, particularly the difference from the peformance of Alternative 3, becomes smaller when we give more texts to the alternative methods. But the comparison in such settings is actually a difficult task because of the time required for downloading. It is our possible future work.</Paragraph> </Section> class="xml-element"></Paper>