XML Viewer - p02-1032

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/p02-1032_metho.xml
Size: 27,724 bytes
Last Modified: 2025-10-06 14:07:55
<?xml version="1.0" standalone="yes"?>
<Paper uid="P02-1032">
  <Title>The Descent of Hierarchy, and Selection in Relational Semanticsa0</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Linguistic Motivation
</SectionTitle>
    <Paragraph position="0"> One way to understand the relations between the words in a two-word noun compound is to cast the words into Computational Linguistics (ACL), Philadelphia, July 2002, pp. 247-254. Proceedings of the 40th Annual Meeting of the Association for a head-modifier relationship, and assume that the head noun has an argument structure, much the way verbs do, as well as a qualia structure in the sense of Pustejovsky (1995). Then the meaning of the head noun determines what kinds of things can be done to it, what it is made of, what it is a part of, and so on.</Paragraph>
    <Paragraph position="1"> For example, consider the noun knife. Knives are created for particular activities or settings, can be made of various materials, and can be used for cutting or manipulating various kinds of things. A set of relations for knives, and example NCs exhibiting these relations is shown below: (Used-in): kitchen knife, hunting knife (Made-of): steel knife, plastic knife (Instrument-for): carving knife (Used-on): meat knife, putty knife (Used-by): chef's knife, butcher's knife Some relationships apply to only certain classes of nouns; the semantic structure of the head noun determines the range of possibilities. Thus if we can capture regularities about the behaviors of the constituent nouns, we should also be able to predict which relations will hold between them.</Paragraph>
    <Paragraph position="2"> We propose using the categorization provided by a lexical hierarchy for this purpose. Using a large collection of noun compounds, we assign semantic descriptors from the lexical hierarchy to the constituent nouns and determine the relations between them. This approach avoids the need to enumerate in advance all of the relations that may hold. Rather, the corpus determines which relations occur.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 The Lexical Hierarchy: MeSH
MeSH (Medical Subject Headings)1 is the National Li-
</SectionTitle>
    <Paragraph position="0"> brary of Medicine's controlled vocabulary thesaurus; it consists of set of terms arranged in a hierarchical structure. There are 15 main sub-hierarchies (trees) in MeSH, each corresponding to a major branch of medical terminology. For example, tree A corresponds to Anatomy, tree B to Organisms, tree C to Diseases and so on. Every branch has several sub-branches; Anatomy, for example, consists of Body Regions (A01), Musculoskeletal System (A02), Digestive System (A03) etc. We refer to these as &amp;quot;level 0&amp;quot; categories.</Paragraph>
    <Paragraph position="1"> These nodes have children, for example, Abdomen (A01.047) and Back (A01.176) are level 1 children of Body Regions. The longer the ID of the MeSH term, the longer the path from the root and the more precise the description. For example migraine is C10.228.140.546.800.525, that is, C (a disease), C10  reported in this paper uses MeSH 2001.</Paragraph>
    <Paragraph position="2"> System Diseases) and so on. There are over 35,000 unique IDs in MeSH 2001. Many words are assigned more than one MeSH ID and so occur in more than one location within the hierarchy; thus the structure of MeSH can be interpreted as a network.</Paragraph>
    <Paragraph position="3"> Some of the categories are more homogeneous than others. The tree A (Anatomy) for example, seems to be quite homogeneous; at level 0, the nodes are all part of (meronymic to) Anatomy: the Digestive (A03), Respiratory (A04) and the Urogenital (A05) Systems are all part of anatomy; at level 1, the Biliary Tract (A03.159) and the Esophagus (A03.365) are part of the Digestive System (level 0) and so on. Thus we assume that every node is a (body) part of the parent node (and all the nodes above it).</Paragraph>
    <Paragraph position="4"> Tree C for Diseases is also homogeneous; the child nodes are a kind of (hyponym of) the disease at the parent node: Neoplasms (C04) is a kind of Disease C and Hamartoma (C04.445) is a kind of Neoplasms.</Paragraph>
    <Paragraph position="5"> Other trees are more heterogeneous, in the sense that the meanings among the nodes are more diverse. Information Science (L01), for example, contains, among others, Communications Media (L01.178), Computer Security (L01.209) and Pattern Recognition (L01.725). Another heterogeneous sub-hierarchy is Natural Science (H01). Among the children of H01 we find Chemistry (parent of Biochemistry), Electronics (parent of Amplifiers and Robotics), Mathematics (Fractals, Game Theory and Fourier Analysis). In other words, we find a wide range of concepts that are not described by a simple relationship. null These observations suggest that once an algorithm descends to a homogeneous level, words falling into the subhierarchy at that level (and below it) behave similarly with respect to relation assignment.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Counting Noun Compounds
</SectionTitle>
    <Paragraph position="0"> In this and the next section, we describe how we investigated the hypothesis: For all two-word noun compounds (NCs) that can be characterized by a category pair (CP), a particular semantic relationship holds between the nouns comprising those NCs.</Paragraph>
    <Paragraph position="1"> The kinds of relations we found are similar to those described in Section 2. Note that, in this analysis we focused on determining which sets of NCs fall into the same relation, without explicitly assigning names to the relations themselves. Furthermore, the same relation may be described by many different category pairs (see Section 5.5).</Paragraph>
    <Paragraph position="2"> First, we extracted two-word noun compounds from approximately 1M titles and abstracts from the Medline collection of biomedical journal articles, resulting  indicates the number of unique NCs that fall under the CP. Only those for which a2a4a3a6a5 NCs occur are shown.</Paragraph>
    <Paragraph position="3"> in about 1M NCs. The NCs were extracted by finding adjacent word pairs in which both words are tagged as nouns by a tagger and appear in the MeSH hierarchy, and the words preceding and following the pair do not appear in MeSH2 Of these two-word noun compounds, 79,677 were unique.</Paragraph>
    <Paragraph position="4"> Next we used MeSH to characterize the NCs according to semantic category(ies). For example, the NC fibroblast growth was categorized into A11.329.228 (Fibroblasts) and G07.553.481 (Growth).</Paragraph>
    <Paragraph position="5"> Note that the same words can be represented at different levels of description. For example, fibroblast growth can be described by the MeSH descriptors A11.329.228 G07.553.481 (original level), but also by A11 G07 (Cell  and Physiological Processes) or A11.329 G07.553 (Connective Tissue Cells and Growth and Embryonic Development). If a noun fell under more than one MeSH ID,  we made multiple versions of this categorization. We refer to the result of this renaming as a category pair (CP). We placed these CPs into a two-dimensional table, with the MeSH category for the first noun on the X axis, and the MeSH category for the second noun on the Y axis. Each intersection indicates the number of NCs that are classified under the corresponding two MeSH categories. null A visualization tool (Ahlberg and Shneiderman, 1994) allowed us to explore the dataset to see which areas of the category space are most heavily populated, and to get a feeling for whether the distribution is uniform or not (see Figure 1). If our hypothesis holds (that NCs that fall 2Clearly, this simple approach results in some erroneous extractions. null within the same category pairs are assigned the same relation), then if most of the NCs fall within only a few category pairs then we only need to determine which relations hold between a subset of the possible pairs. Thus, the more clumped the distribution, the easier (potentially) our task is. Figure 1 shows that some areas in the CP space have a higher concentration of unique NCs (the Anatomy, and the E through N sub-hierarchies, for example), especially when we focus on those for which at least 50 unique NCs are found.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Labeling NC Relations
</SectionTitle>
    <Paragraph position="0"> Given the promising nature of the NC distributions, the question remains as to whether or not the hypothesis holds. To answer this, we examined a subset of the CPs to see if we could find positions within the sub-hierarchies for which the relation assignments for the member NCs are always the same.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Method
</SectionTitle>
      <Paragraph position="0"> We first selected a subset of the CPs to examine in detail.</Paragraph>
      <Paragraph position="1"> For each of these we examined, by hand, 20% of the NCs they cover, paraphrasing the relation between the nouns, and seeing if that paraphrase was the same for all the NCs in the group. If it was the same, then the current levels of the CP were considered to be the correct levels of description. If, on the other hand, several different paraphrases were found, then the analysis descended one level of the hierarchy. This repeated until the resulting partition of the NCs resulted in uniform relation assignments.</Paragraph>
      <Paragraph position="2"> For example, all the following NCs were mapped to the same CP, A01 (Body Regions) and A07 (Cardiovascular System): scalp arteries, heel capillary, shoulder artery, ankle artery, leg veins, limb vein, forearm arteries, finger capillary, eyelid capillary, forearm microcirculation, hand vein, forearm veins, limb arteries, thigh vein, foot vein. All these NCs are &amp;quot;similar&amp;quot; in the sense that the relationships between the two words are the same; therefore, we do not need to descend either hierarchy. We call the pair (A01, A07) a &amp;quot;rule&amp;quot;, where a rule is a CP for which all the NCs under it have the same relationship. In the future, when we see an NC mapped to this rule, we will assign this semantic relationship to it.</Paragraph>
      <Paragraph position="3"> On the other hand, the following NCs, having the CP A01 (Body Regions) and M01 (Persons), do not have the same relationship between the component words: abdomen patients, arm amputees, chest physicians, eye patients, skin donor. The relationships are different depending on whether the person is a patient, a physician or a donor. We therefore descend the M01 sub-hierarchy, obtaining the following clusters of NCs:</Paragraph>
      <Paragraph position="5"> eye nurse, eye physician A01, M01.898 (Donors): eye donor, skin donor A01, M01.150 (Disabled Persons): arm amputees, knee amputees.</Paragraph>
      <Paragraph position="6"> In other words, to correctly assign a relationship to these NCs, we needed to descend one level for the second word. The resulting rules in this case are (A01 M01.643), (A01, M01.150) etc. Figure 2 shows one CP for which we needed to descend 3 levels.</Paragraph>
      <Paragraph position="7"> In our collection, a total of 2627 CPs at level 0 have at least 10 unique NCs. Of these, 798 (30%) are classified with A (Anatomy) for either the first or the second noun. We randomly selected 250 of such CPs for analysis. We also analyzed 21 of the 90 CPs for which the second noun was H01 (Natural Sciences); we decided to analyze this portion of the MeSH hierarchy because the NCs with H01 as second noun are frequent in our collection, and because we wanted to test the hypothesis that we do indeed need to descend farther for heterogeneous parts of MeSH.</Paragraph>
      <Paragraph position="8"> Finally, we analyzed three CPs in category C (Diseases); the most frequent CP in terms of the total number of non-unique NCs is C04 (Neoplasms) A11 (Cells), with 30606 NCs; the second CP was A10 C04 (27520 total NCs) and the fifth most frequent, A01 C04, with 20617 total NCs; we analyzed these CPs.</Paragraph>
      <Paragraph position="9"> We started with the CPs at level 0 for both words, descending when the corresponding clusters of NCs were not homogeneous and stopping when they were. We did this for 20% of the NCs in each CP. The results were as follows.</Paragraph>
      <Paragraph position="10"> For 187 of 250 (74%) CPs with a noun in the Anatomy category, the classification remained at level 0 for both words (for example, A01 A07). For 55 (22%) of the CPs we had to descend 1 level (e.g., A01 M01: A01 M01.898, A01 M01.643) and for 7 CPs (2%) we descended two levels. We descended one level most of the time for the sub-hierarchies E (Analytical, Diagnostic and Therapeutic Techniques), G (Biological Sciences) and N (Health Care) (around 50% of the time for these categories combined). We never descended for B (Organisms) and did so only for A13 (Animal Structures) in A. This was to be able to distinguish a few non-homogeneous subcategories (e.g., milk appearing among body parts, thus forcing a distinction between buffalo milk and cat forelimb).</Paragraph>
      <Paragraph position="11"> For CPs with H01 as the second noun, of the 21 CPs analyzed, we observed the following (level number, count) pairs: (0, 1) (1, 8) (2, 12).</Paragraph>
      <Paragraph position="12"> In all but three cases, the descending was done for the second noun only. This may be because the second noun usually plays the role of the head noun in two-word noun compounds in English, thus requiring more specificity.</Paragraph>
      <Paragraph position="13"> Alternatively, it may reflect the fact that for the examples we have examined so far, the more heterogeneous terms dominate the second noun. Further examination is needed to answer this decisively.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Accuracy
</SectionTitle>
      <Paragraph position="0"> We tested the resulting classifications by developing a randomly chosen test set (20% of the NCs for each CP), entirely distinct from the labeled set, and used the classifications (rules) found above to automatically predict which relations should be assigned to the member NCs. An independent evaluator with biomedical training checked these results manually, and found high accuracies: For the CPs which contained a noun in the Anatomy domain, the assignments of new NCs were 94.2% accurate computed via intra-category averaging, and 91.3% accurate with extra-category averaging. For the CPs in the Natural Sciences (H01) we found 81.6% accuracy via intra-category averaging, and 78.6% accuracy with extra-category averaging. For the three CPs in the C04 category we obtained 100% accuracy.</Paragraph>
      <Paragraph position="1"> The total accuracy across the portions of the A, H01 and C04 hierarchies that we analyzed were 89.6% via intra-category averaging, and 90.8% via extra-category averaging.</Paragraph>
      <Paragraph position="2"> The lower accuracy for the Natural Sciences category illustrates the dependence of the results on the properties of the lexical hierarchy. We can generalize well if the sub-hierarchies are in a well-defined semantic relation with their ancestors. If they are a list of &amp;quot;unrelated&amp;quot; topics, we cannot use the generalization of the higher levels; most of the mistakes for the Natural Sciences CPs occurred in fact when we failed to descend for broad terms such as Physics. Performing this evaluation allowed us to find such problems and update the rules; the resulting categorization should now be more accurate.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Generalization
</SectionTitle>
      <Paragraph position="0"> An important issue is whether this method is an economic way of classifying the NCs. The advantage of the high level description is, of course, that we need to assign by hand many fewer relationships than if we used all CPs at their most specific levels. Our approach provides generalization over the &amp;quot;training&amp;quot; examples in two ways. First, we find that we can use the juxtaposition of categories in a lexical hierarchy to identify semantic relationships.</Paragraph>
      <Paragraph position="1"> Second, we find we can use the higher levels of these categories for the assignments of these relationships.</Paragraph>
      <Paragraph position="2"> To assess the degree of this generalization we calculated how many CPs are accounted for by the classification rules created above for the Anatomy categories. In other words, if we know that A01 A07 unequivocally determines a relationship, how many possible (i.e., present in our collection) CPs are there that are &amp;quot;covered by&amp;quot; A01 A07 and that we do not need to consider explicitly? It turns out that our 415 classification rules cover 46001 possible CP pairs3.</Paragraph>
      <Paragraph position="3"> This, and the fact that we achieve high accuracies with these classification rules, show that we successfully use MeSH to generalize over unique NCs.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.4 Ambiguity
</SectionTitle>
      <Paragraph position="0"> A common problem for NLP tasks is ambiguity. In this work we observe two kinds: lexical and &amp;quot;relationship&amp;quot; ambiguity. As an example of the former, mortality can refer to the state of being mortal or to death rate. As an example of the latter, bacteria mortality can either mean &amp;quot;death of bacteria&amp;quot; or &amp;quot;death caused by bacteria&amp;quot;. In some cases, the relationship assignment method described here can help disambiguate the meaning of an ambiguous lexical item. Milk for example, can be both Animal Structures (A13) and Food and Beverages (J02).</Paragraph>
      <Paragraph position="1"> Consider the NCs chocolate milk, coconut milk that fall under the CPs (B06 -Plants-, J02) and (B06, A13). The CP (B06, J02) contains 180 NCs (other examples are berry wines, cocoa beverages) while (B06, A13) has only 6 NCs (4 of which with milk). Assuming then that (B06, A13) is &amp;quot;wrong&amp;quot;, we will assign only (B06, J02) to chocolate milk, coconut milk, therefore disambiguating the sense for milk in this context (Beverage). Analogously, for buffalo milk, caprine milk we also have two CPs (B02, J02) (B02, A13). In this case, however, it is easy to show that only (B02 -Vertebrates-, A13) is the correct one (i.e. yielding the correct relationship) and we then assign the MeSH sense A13 to milk.</Paragraph>
      <Paragraph position="2"> Nevertheless, ambiguity may be a problem for this method. We see five different cases: 3Although we began with 250 CPs in the A category, when a descend operation is performed, the CP is split into two or more CPs at the level below. Thus the total number of CPs after all assignments are made was 415.</Paragraph>
      <Paragraph position="3"> 1) Single MeSH senses for the nouns in the NC (no lexical ambiguity) and only one possible relationship which can predicted by the CP; that is, no ambiguity. For instance, in abdomen radiography, abdomen is classified  exclusively under Body Regions and radiography exclusively under Diagnosis, and the relationship between them is unambiguous. Other examples include aciclovir treatment (Heterocyclic Compounds, Therapeutics) and adenocarcinoma treatment (Neoplasms, Therapeutics).</Paragraph>
      <Paragraph position="4"> 2) Single MeSH senses (no lexical ambiguity) but mul- null tiple readings for the relationships that therefore cannot be predicted by the CP. It was quite difficult to find examples of this case; disambiguating this kind of NC requires looking at the context of use. The examples we did find include hospital databases which can be databases regarding (topic) hospitals, databases found in (location) or owned by hospitals. Education efforts can be efforts done through (education) or done to achieve education.</Paragraph>
      <Paragraph position="5"> Kidney metabolism can be metabolism happening in (location) or done by the kidney. Immunoglobulin staining, (D12 -Amino Acids, Peptides-, and Proteins, E05 Investigative Techniques-) can mean either staining with immunoglobulin or staining of immunoglobulin.</Paragraph>
      <Paragraph position="6"> 3) Multiple MeSH mappings but only one possible relation. One example of this case is alcoholism treatment where treatment is Therapeutics (E02) and alcoholism is both Disorders of Environmental Origin (C21) and Mental Disorders (F03). For this NC we have therefore 2 CPs: (C21, E02) as in wound treatments, injury rehabilitation and (F03, E02) as in delirium treatment, schizophrenia therapeutics. The multiple mappings reflect the conflicting views on how to classify the condition of alcoholism, but the relationship does not change.</Paragraph>
      <Paragraph position="7"> 4) Multiple MeSH mappings and multiple relations that can be predicted by the different CPs. For example, Bread diet can mean either that a person usually eats bread or that a physician prescribed bread to treat a condition. This difference is reflected by the different mappings: diet is both Investigative Techniques (E05) and Metabolism and Nutrition (G06), bread is Food and Beverages (J02). In these cases, the category can help disambiguate the relation (as opposed to in case 5 below); word sense disambiguation algorithms that use context may be helpful.</Paragraph>
      <Paragraph position="8"> 5) Multiple MeSH mappings and multiple relations that cannot be predicted by the different CPs. As an example of this case, bacteria mortality can be both &amp;quot;death of bacteria&amp;quot; or &amp;quot;death caused by bacteria&amp;quot;. The multiple mapping for mortality (Public Health, Information Science, Population Characteristics and Investigative Techniques) does not account for this ambiguity. Similarly, for inhibin immunization, the first noun falls under Hormones and Amino Acids, while immunization falls under Environment and Public Health and Investigative Techniques. The meanings are immunization against inhibin or immunization using inhibin, and they cannot be disambiguated using only the MeSH descriptors.</Paragraph>
      <Paragraph position="9"> We currently do not have a way to determine how many instances of each case occur. Cases 2 and 5 are the most problematic; however, as it was quite difficult to find examples for these cases, we suspect they are relatively rare. A question arises as to if representing nouns using the topmost levels of the hierarchy causes a loss in information about lexical ambiguity. In effect, when we represent the terms at higher levels, we assume that words that have multiple descriptors under the same level are very similar, and that retaining the distinction would not be useful for most computational tasks. For example, osteosarcoma occurs twice in MeSH, as C04.557.450.565.575.650 and C04.557.450.795.620. When described at level 0, both descriptors reduce to C04, at level 1 to C04.557, removing the ambiguity. By contrast, microscopy also occurs twice, but under E05.595 and H01.671.606.624. Reducing these descriptors to level 0 retains the two distinct senses.</Paragraph>
      <Paragraph position="10"> To determine how often different senses are grouped together, we calculated the number of MeSH senses for words at different levels of the hierarchy. Table 1 shows a histogram of the number of senses for the first noun of all the unique NCs in our collection, the average degree of ambiguity and the average description lengths.4 The average number of MeSH senses is always less than two, and increases with length of description, as is to be expected. null We observe that 3.6% of the lexical ambiguity is at levels higher that 2, 16% at L2, 21.4% at L1 and 59% at L0. Level 1 and 2 combined account for more than 80% of the lexical ambiguity. This means that when a noun has multiple senses, those senses are more likely to come from different main subtrees of MeSH (A and B, for example), than from different deeper nodes in the same subtree (H01.671.538 vs. H01.671.252). This fits nicely with our method of describing the NCs with the higher levels of the hierarchy: if most of the ambiguity is at the highest levels (as these results show), information about lexical ambiguity is not lost when we describe the NCs using the higher levels of MeSH. Ideally, however, we would like to reduce the lexical ambiguity for similar senses and to retain it when the senses are semantically distinct (like, for example, for diet in case 4). In other words, ideally, the ambiguity left at the levels of our rules accounts for only (and for all) the semantically different senses. Further analysis is needed, but the high accuracy we obtained in the classification seems to indicate that this indeed is what is happening.</Paragraph>
      <Paragraph position="11">  to different levels of MeSH. Original refers to the actual (nontruncated) MeSH descriptor. Avg # Senses is the average number of senses computed for all first nouns in the collection. Avg Desc Len is the average description length; the value for level 1 is less than 2 and for level 2 is less that 3, because some nouns are always mapped to higher levels (for example, cell is always mapped to A11).</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.5 Multiple Occurrences of Semantic Relations
</SectionTitle>
      <Paragraph position="0"> Because we determine the possible relations in a data-driven manner, the question arises of how often does the same semantic relation occur for different category pairs.</Paragraph>
      <Paragraph position="1"> To determine the answer, we could (i) look at all the CPs, give a name to the relations and &amp;quot;merge&amp;quot; the CPs that have the same relationships; or (ii) draw a sample of NC examples for a given relation, look at the CPs for those examples and verify that all the NCs for those CPs are indeed in the same relationship.</Paragraph>
      <Paragraph position="2"> We may not be able to determine the total number of relations, or how often they repeat across different CPs, until we examine the full spectrum of CPs. However, we did a preliminary analysis to attempt to find relation repetition across category pairs. As one example, we hypothesized a relation afflicted by and verified that it applies to all the CPs of the form (Disease C, Patients M01.643), e.g.: anorexia (C23) patients, cancer (C04) survivor, influenza (C02) patients. This relation also applies to some of the F category (Psychiatry), as in delirium (F03) patients, anxiety (F01) patient.</Paragraph>
      <Paragraph position="3"> It becomes a judgement call whether to also include NCs such as eye (A01) patient, gallbladder (A03) patients, and more generally, all the (Anatomy, Patients) pairs. The question is, is &amp;quot;afflicted-by (unspecified) Disease in Anatomy Part&amp;quot; equivalent to &amp;quot;afflicted by Disease?&amp;quot; The answer depends on one's theory of relational semantics. Another quandary is illustrated by the NCs adolescent cancer, child tumors, adult dementia (in which adolescent, child and adult are Age Groups) and the heads are Diseases. Should these fall under the afflicted by relation, given the references to entire groups?</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML