File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/p91-1019_metho.xml

Size: 19,767 bytes

Last Modified: 2025-10-06 14:12:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="P91-1019">
  <Title>SUBJECT-DEPENDENT CO-OCCURRENCE AND WORD SENSE DISAMBIGUATION</Title>
  <Section position="4" start_page="0" end_page="146" type="metho">
    <SectionTitle>
CO-OCCURRENCE NEIGHBORHOODS
</SectionTitle>
    <Paragraph position="0"> Words which occur frequently with a given word may be thought of as forming a &amp;quot;neighborhood&amp;quot; of that word. If we can determine which words (i.e. spelling forms) co-occur frequently with each word sense, we can use these neighborhoods to disambiguate the word in a given text.</Paragraph>
    <Paragraph position="1"> Assume that we know of only two of the classic senses of the word bank:  1) A repository for money, and 2) A pile of earth on the edge of a river.  We can expect the &amp;quot;money&amp;quot; sense of bank to co-occur frequently with such words  as &amp;quot;money&amp;quot;, &amp;quot;loan&amp;quot;, and &amp;quot;robber&amp;quot;, while the &amp;quot;fiver&amp;quot; sense would be more frequently associated with &amp;quot;river&amp;quot;, &amp;quot;bridge&amp;quot;, and &amp;quot;earth&amp;quot;. In order to disambiguate &amp;quot;bank&amp;quot; in a text, we would produce neighborhoods for each sense, and intersect them with the text, our assumption being that the neighborhood which shared more words with the text would determine the correct sense. Variations of this idea appear in (l.,esk, 1986; McDonald, et al., 1990; Wilks, 1987; 1990; Veronis and Ide, 1990).</Paragraph>
    <Paragraph position="2"> Previously, McDonald and Plate (McDonald et al., 1990; Schvaneveldt, 1990) used the LDOCE definitions as their text, in order to generate co-occurrence data for the 2,187 words in the LDOCE control (defining) vocabulary. They used various methods to apply this data to the problem of disambiguating control vocabulary words as they appear in the LDOCE example sentences. In every case however, the neighborhood of a given word was a co-occurrence neighborhood for its spelling form over all the definitions in the dictionary. Distinct neighborhoods corresponding to distinct senses had to be obtained by using the words in the sense definition as a core for the neighborhood, and expanding it by combining it with additional words from the co-occurrence neighborhoods of the core words.</Paragraph>
  </Section>
  <Section position="5" start_page="146" end_page="148" type="metho">
    <SectionTitle>
SUBJECT-DEPENDENT NEIGHBORHOODS
</SectionTitle>
    <Paragraph position="0"> The study of word co-occurrence in a text is based on the cliche that &amp;quot;one (a word) is known by the company one keeps&amp;quot;. We hold that it also makes a difference where that company is kept: since a word may occur with different sets of words in different contexts, we construct word neighborhoods which depend on the subject of the text in question. We call these, naturally enough, &amp;quot;subject-dependent neighborhoods&amp;quot;. A unique feature of the electronic version of LDOCE is that many of the word sense definitions are marked with a subject field code which tells us which subject area the sense pertains to. For example, the &amp;quot;money&amp;quot;-related senses of bank are marked EC (Economics), and for each such main subject heading, we consider the subset of LDOCE definitions that consists of those sense definitions which sham that subject code. These definitions are then collected into one file, and co-occurrence data for their defining vocabulary is generated. Word x is said to co-occur with word y if x and y appear in the same sense definition; the total number of times they co-occur is denoted as We then construct a 2,187 x 2,187 matrix in which each row and column corresponds to one word of the defining vocabulary, and the entry in the xth row and yth column represents the number of times the xth word co-occurred with the yth word.</Paragraph>
    <Paragraph position="1"> (This is a symmetric matrix, and therefore it is only necessary to maintain half of it.) We denote by f, the total number of times word x appeared. While many statistics may be used to measure the relatedness of words x and y, we used the function</Paragraph>
    <Paragraph position="3"> in this study. We choose a co-occurrence neighborhood of a word x from a set of closely related words. We may choose the ten words with the highest relatedness statistic, for instance.</Paragraph>
    <Paragraph position="4"> Neighborhoods of the word &amp;quot;metal&amp;quot; in the category &amp;quot;Economics&amp;quot; and &amp;quot;Business&amp;quot; are presented below:  In this example, the ~ghborhoods reflect a fundamental difference between the two subject areas. Economics is a more theoretical subject, and therefore its neighborhood contains words like &amp;quot;idea&amp;quot;, &amp;quot;gold&amp;quot;, &amp;quot;silver&amp;quot;, and &amp;quot;real&amp;quot;, while in the more practical domain of Business, we find the words &amp;quot;brass&amp;quot;, &amp;quot;apparatus&amp;quot;, &amp;quot;spring&amp;quot;, and &amp;quot;plate&amp;quot;. We can expect the contrast between subject neighborhoods to be especially great for words with senses that fall into different subject areas. Consider the actual neighborhoods of our original example, bank.</Paragraph>
    <Paragraph position="5">  bank Subject Code EC = Economies account cheque money by into have keep order out pay at put from draw an busy more supply it safe Table 4. Engineering neighborhood of bank bank Subject Code EG = Engineering  river wall flood thick earth prevent opposite chair hurry paste spread overflow walk help we throw clay then wide level Notice that even though we included the twenty most closely related words in each neighborhood, they are still unrelated or disjoint, although many of the words which appear in the lists are indeed suggestive of the sense or senses which fall under that subject category. In LDOCE, three of the eleven senses of bank are marked with the code EC for Economics, and these represent the &amp;quot;money&amp;quot; senses of the word. It is a quirk of the classification in LDOCE that the &amp;quot;river&amp;quot; senses of bank are not marked with a subject code.</Paragraph>
    <Paragraph position="6"> This lack of a subject code for a word sense in LDOCE is not uncommon, however, and as was the case with bank, some word senses may have subject codes, while others do not. We label this lack of a sub-ject code the &amp;quot;null code&amp;quot;, and form a neighborhood of this type of sense by using all sense definitions without code as text. This &amp;quot;null code neighborhood&amp;quot; can reveal the common, or &amp;quot;generic&amp;quot; sense of the word. The twenty most frequently occurring words with bank in definitions with the null subject code form the following neighborhood: null Table 5. Null Code neighborhood of bank Subject Code NULL = no code assigned bank rob river account lend overflow flood money criminal lake flow snow cliff police shore heap thief borrow along steep earth It is obvious that approximately half of these words are associated with our two main senses of bank-but a new element has crept in: the appearance of four out of eight words which refer to the money sense (&amp;quot;rob&amp;quot;, &amp;quot;criminal&amp;quot;, &amp;quot;police&amp;quot;, and &amp;quot;thief&amp;quot;) reveal a sense of bank which did not appear in the EC neighborhood. In the null code definitions, there are quite a few references to the potential for a bank to be robbed. Finally, for comparison, consider a neighborhood for bank which uses all the LDOCE definitions (see McDonald et al., 1990; Schvaneveldt, 1990; Wilks et al., 1990):</Paragraph>
    <Section position="1" start_page="147" end_page="148" type="sub_section">
      <SectionTitle>
Subject Code All
</SectionTitle>
      <Paragraph position="0"> bank account bank busy cheque criminal earn flood flow interest lake lend money overflow pay river rob safes and thief wall Only four of these words (&amp;quot;bank&amp;quot;, &amp;quot;cam&amp;quot;, &amp;quot;sand&amp;quot;, and &amp;quot;thief&amp;quot;) are not found in  the other three neighborhoods, and the number of words in the intersection of this neighborhood with the Economics, Engineering, and Null neighborhoods are: six, four, and eleven, respectively. Recalling that the Economics and Engineering neighborhoods are disjoint, this data supports our hypothesis that the subject-dependent neighborhoods help us to distinguish senses more easily than neighborhoods which are extracted from the whole dictionary.</Paragraph>
      <Paragraph position="1"> There are over a hundred main subject field codes in LDOCE, and over threehundred sub-divisions within these. For example, &amp;quot;medicine-and-biology&amp;quot; is a main subject field (coded &amp;quot;MD&amp;quot;), and has twenty-two sub-divisions such as &amp;quot;anatomy&amp;quot; and &amp;quot;biochemistry&amp;quot;. These main codes and their sub-divisions constitute the only two levels in the LDOCE subject code hierarchy, and main codes such as &amp;quot;golf' and &amp;quot;sports&amp;quot; are not related to each other. Cknrently, we use only the main codes when we are constructing a subject-dependent neighborhood. But even this division of the definition text is fine enough so that, given a word and a sub-ject code, the word may not appear in the definitions which have that subject code at all.</Paragraph>
      <Paragraph position="2"> To overcome this problem, we have adopted a restructured hierarchy of the sub-ject codes, as developed b~y Slator (1988). This tree structure has a node at the top, representing all the definitions. At the next level are six fundamental categories such as &amp;quot;science&amp;quot; and &amp;quot;transportation&amp;quot;, as well as the null code. These clusters are further sub-divided so that some main codes become sub-divisions of others (&amp;quot;golf' becomes a sub-division of &amp;quot;sports&amp;quot;, etc.). The maximum depth of this tree is five levels.</Paragraph>
      <Paragraph position="3"> If the word for which we want to produce a neighborhood appears too infrequently in definitions with a given code, we travel up the hierarchy and expand the text under consideration until we have reached a point where the word appears frequently enough to allow the neighborhood to be constructed. The worst case scenario would be one in which we had traveled all the way to the top of the hierarchy and used all the definitions as the text, only to wind up with the same co-occurrence neighborhoods as did McDonald and Plate (Schvaneveldt, 1990; Wilks et al., 1990)! There are certain drawbacks in using LDOCE to construct the subject-dependent neighborhoods, however, the amount of text in LDOCE about any one subject area is rather limited, is comprised of a control vocabulary for dictionary definitions only, and uses sample sentences which were concocted with non-native English speakers in mind.</Paragraph>
      <Paragraph position="4"> In the next phase of our research, large corpora consisting of actual documents from a given subject area will be used, in order to obtain neighborhoods which more accurately reflect the sorts of texts which will be used in applications. In the future, these neighborhoods may replace those constructed from LDOCE, while leaving the subject code hierarchy and various applications intact.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="148" end_page="149" type="metho">
    <SectionTitle>
WORD SENSE DISAMBIGUATION
</SectionTitle>
    <Paragraph position="0"> In this section, we describe an application of subject-dependent co-occurrence neighborhoods to the problem of word sense disambiguation. The subject-dependent co-occurrence neighborhoods are used as building blocks for the neighborhoods used in disambiguation. For each of the subject codes (including the null code) which appear with a word sense to be disambiguated, we intersect the corresponding subject-dependent co-occurrence neighborhood with the text being considered (the size of text can vary from a sentence to a paragraph).</Paragraph>
    <Paragraph position="1"> The intersection must contain a pre-selected minimum number of words to be considered.</Paragraph>
    <Paragraph position="2"> But if none of the neighborhoods intersect at greater than this threshold level, we replace the neighborhood N by the neighborhood N(1), which consists of N together with the first word from each neighborhood of words in N, using the same subject code. If necessary, we add the second most strongly associated word for each of the words in the original neighborhood N, forming the neighbor- null hood N(2). We continue this process until a subject-dependent co-occurrence neighborhood has intersection above the threshold level. Then, the sense or senses with this subject code is selected. If more than one sense has the selected code, we use their definitions as cores to build distinguishing neighborhoods for them. These are again intersected with the text to determine the correct sense.</Paragraph>
    <Paragraph position="3"> The following two examples illustrate this method. Note that some of the neighborhoods differ from those given earlier since the text used to construct these neighborhoods includes any example sentences which may occur in the sense definitions.</Paragraph>
    <Paragraph position="4"> Those neighborhoods presented earlier ignored the example sentences. In each example, we attempt to disambiguate the word &amp;quot;bank&amp;quot; in a sentence which appears as an example sentence in the Collins COBUILD English Language Dictionary.</Paragraph>
    <Paragraph position="5"> The disambiguation consists of choosing the correct sense of &amp;quot;bank&amp;quot; from among the thirteen senses given in LDOCE. These senses are summarized below.</Paragraph>
    <Paragraph position="6"> bank(l) : \[ \] : land along the side of a fiver, lake, etc.</Paragraph>
    <Paragraph position="8"> one side higher than the other.</Paragraph>
    <Paragraph position="9"> bank('/) : \[ \] : a row, especially of oars in an ancient boat or keys on a typewriter.</Paragraph>
    <Paragraph position="10"> bank(8) : \[EC\] : a place in which money is kept and paid out on demand.</Paragraph>
    <Paragraph position="11"> bank(9) : \[MD\] : a place where something is held ready for use, such as blood.</Paragraph>
    <Paragraph position="12"> bank(10) : \[GB\] : (a person who keeps) a supply of money or pieces for payment in a gambling game.</Paragraph>
    <Paragraph position="13"> bank(ll) : \[ \] : break the bank is to win all the money in bank(10).</Paragraph>
    <Paragraph position="14"> bank(12) : \[EC\] : to put or keep (money) in a bank.</Paragraph>
    <Paragraph position="15"> bank(13) : \[EC\] : to keep ones money in a bank. Example 1. The sentence is 'Whe aircraft turned, banking slightly.&amp;quot; The neighborhoods of &amp;quot;bank&amp;quot; for the five relevant subject codes are given below.</Paragraph>
    <Paragraph position="16"> Table 7. Automotive neighborhood of bank Subject Code ALl = Automotive bank make go up move so they high also round car side turn road aircraft slope bend  have it person out into take money put write keep pay order another paper draw supply account safe sum cheque  game earth stone boat fiver bar snow lake sand shore mud framework flood cliff heap harbor ocean parallel overflow clerk The AU neighborhood contains two words, &amp;quot;aircraft&amp;quot; and &amp;quot;turn&amp;quot;, which also appear in the sentence. Note that we consider all forms of tum (tumed, tuming, etc.) to match &amp;quot;turn&amp;quot;. Since none of the other neighborhoods have any words in common with the sentence, and since our threshold value for this short sentence is 2, AU is selected as the subject code. We must now decide between the two senses which have this code.</Paragraph>
    <Paragraph position="17"> At this point we remove the function words from the sense definitions and replace each remaining word by its root form. We obtain the following neighborhoods.</Paragraph>
    <Paragraph position="18"> Table 12. Words in sense 4 of bank Definition bank(4) slope make bend road so they safe car go round Table 13. Words in sense 6 of bank Definition bank(6) car aircraft move side high make turn Since bank(4) has no words in common with the sentence, and bank(6) has two Ctum&amp;quot; and &amp;quot;aircraft&amp;quot;), bank(6) is selected. This is indeed the sense of &amp;quot;bank&amp;quot; used in the sentence.</Paragraph>
    <Paragraph position="19"> Example 2. The sentence is &amp;quot;We got a bank loan to buy a car.&amp;quot; The original neighborhoods of &amp;quot;bank&amp;quot; are, of course, the same as in Example 1. The threshold is again 2. None of the neighborhoods has more than one word in common with the sentence, so the iterative process of enlarging the neighborhoods is used. The AU neighborhood is expanded to include &amp;quot;engine&amp;quot; since it is the first word in the AU neighborhood of &amp;quot;make&amp;quot;. The first word in the AU neighborhood of &amp;quot;up&amp;quot; is &amp;quot;increase&amp;quot;, so &amp;quot;increase&amp;quot; is added to the neighborhood. If the word to be added already appears in the neighborhood of &amp;quot;bank&amp;quot;, no word is added.</Paragraph>
    <Paragraph position="20"> On the fifteenth iteration, the EC neighborhood contains &amp;quot;get&amp;quot; and &amp;quot;buy&amp;quot;. None of the other neighborhoods have more than one word in common with the sentence, so EC is selected as the subject code. Definitions 8, 12, and 13 of bank all have the EC subject code, so their definitions are used as cores to build neighborhoods to allow us to choose one of them. After twenty-three iterations, bank(8) is selected. Experiments are underway to test this method and variations of it on large numbers of sentences so that its effectiveness may be compared with other disambiguation techniques. Results of these experiments will be reported elsewhere.</Paragraph>
  </Section>
  <Section position="7" start_page="149" end_page="151" type="metho">
    <SectionTitle>
FURTHER APPUCATIONS
</SectionTitle>
    <Paragraph position="0"> Several applications of subject-dependent neighborhoods in addition to word-sense disambiguation are being pursued, as well. For information retrieval, previously constructed neighborhoods relevant to the subject area can be used to expand a query and the target (titles, key words, etc.) to include more words in the intersection, and improve both recall and precision.</Paragraph>
    <Paragraph position="1"> Another application is the determination of the subject area of a text. Since the effectiveness of searching for key words to determine the topic of a text is limited by the choice of the particular list of key words, and the fact that the text may use synonyms or refer to the concept the key word represents without using it (for example by using a pronoun in its place), we could look for word associations (thereby involving more words in the process and making it less vulnerable to the above problems),  rather than simply searching for key words indicative of a topic. Neighborhoods of words in the text could be constructed for each of the six fundamental categories, and intersected with the surrounding words in the text. After choosing the category with the greatest intersection, we would then traverse the subject code tree downward to arrive at a more specific code, stopping at any point where there is not enough data to allow us to choose one code over the others at that level. Once a subject code is selected for a text, it could be used as a context for word-sense disambiguation.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML