File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3106_metho.xml

Size: 19,399 bytes

Last Modified: 2025-10-06 14:09:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3106">
  <Title>Clustering MeSH Representations of Biomedical Literature</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Document Representations
</SectionTitle>
    <Paragraph position="0"> Representing documents for clustering and other text mining tasks is a fundamental step in the knowledge discovery process. The ability to derive useful information from a document collection may be entirely determined by the attributes used to describe the documents. A commonly used representation in text mining and information retrieval is the vector representation. A summary of vector representations is presented below and refer the reader to a text on information retrieval (Korfhage, 1997) for a more detailed description.</Paragraph>
    <Paragraph position="1"> Suppose D is a collection of documents and T = {t1,t2,...,tn} is the collection of unique terms appearing in at least one document in D. Obtaining T is typically accomplished by extracting individual words (e.g., characters between spaces) from the text (e.g. titles, abstracts, and body) of each paper, although more sophisticated parsing may occur. Individual words may be further processed by stop word removal, the removal of words without inherent meaning such as articles or pronouns, and stemming, the removal of suffixes to extract only root words. This term processing often generates better classification and information retrieval results.</Paragraph>
    <Paragraph position="2"> Given T, a document d [?] D is represented as a vector vd = &lt;w1,w2,...,wm&gt; , (1) where wi is called the weight of term ti within document d. Weights are defined based on specific application needs.</Paragraph>
    <Paragraph position="3"> Two examples of commonly used weighting schemes are term frequency (TF) and term frequency inverse document frequency (TFIDF). Let |ti |be the number of times ti appears in a document d, |D |be the number of documents in the document collection, and ni be the number of documents in D containing ti. The TF scheme is defined by wi = |ti|. The TFIDF scheme is defined by wi = |ti|/log2(|D|/ni).</Paragraph>
    <Paragraph position="4"> Consider a document collection D with term collection T = {cancer, diagnosis, medical, viral}. If a document d contains three occurences of the term cancer, one occurence of the term diagnosis, four occurences of the term medical, and no occurences of the term viral. The representation of d using TF weighting is vd = &lt;3,1,4,0&gt; .</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 MeSH Representations
</SectionTitle>
      <Paragraph position="0"> This investigation builds on the vector space representation of documents described above. Instead of obtaining a term collection T from the full text of titles, abstracts, or content of a paper, T is built from the MeSH assignments for each document. A summary of MeSH is given below.</Paragraph>
      <Paragraph position="1"> Medical literature is indexed by MeSH terms by the National Library of Medicine (NLM) for the purpose of subject indexing and searching of journal articles in PubMed (an online literature database that contains citations from more than 4,600 biomedical journals). MeSH terms are assigned to medical literature by human indexers. null MeSH consists of two ontologies: descriptors or headings are a collection of terms for primary themes or topics contained in the literature; and qualifiers or subheadings are terms combined with descriptors to indicate the specific aspect of a descriptor. Formally, a MeSH term is a tuple (d,q) where d is a descriptor and q is a qualifier (q may be empty if d is unqualified). There exist 21975 descriptors and 83 qualifiers in the 2003 MeSH ontology, which was used in this study.</Paragraph>
      <Paragraph position="2"> Both descriptors and qualifiers are organized in directed acyclic graphs (DAGs), where the parent of a descriptor or qualifier is considered more general than the term itself. A descriptor (or qualifier) may have multiple parents, representing that the descriptor (or qualifier) includes multiple concepts in the MeSH ontology simultaneously. For example, in the 2003 MeSH ontology, descriptors have an average of approximately 1.8 parents.</Paragraph>
      <Paragraph position="3">  Portions of the descriptor and qualifier ontologies are displayed in Figures 1 and 2.</Paragraph>
      <Paragraph position="4"> In MeSH representations, weights are derived from the structure of MeSH. Documents are represented as vectors where the term collection T consists of descriptors only, qualifiers only, or combined descriptors and qualifiers (this will be further referred to as the combined representation). Weights are defined by</Paragraph>
      <Paragraph position="6"> A term is inferred if one of its descendants in the MeSH hierarchy is assigned, but the term itself is not assigned.</Paragraph>
      <Paragraph position="7"> Consider d with the term Viremia assigned. The descriptors only representation is vd = &lt;0,1,1,1,0,1,1,1,2,0&gt; , where the columns correspond to Animal Diseases, Bacterial Infections and Mycoses, Diseases, Infection, Parasitic Diseases, Sepsis, Septicemia, Viral Diseases, Viremia, and Zoonoses respectively. The relationship between the MeSH hierarchy and the values assigned is demonstrated in Figure 1. In essence, the DAG structure is flattened, but allowable vectors for document representation are restricted to the structure imposed by MeSH.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Document Clustering
</SectionTitle>
    <Paragraph position="0"> Many clustering algorithms have been proposed for document clustering. In this study, AGNES (Kaufman and Rousseeuw, 1990), an agglomerative hierarchical clustering algorithm, with average linking was employed. Using this algorithm has two advantages for this study. First, dendrograms, a visualization of the substructures contained in a document collection, are produced. Second, AGNES computes an agglomerative coefficient a. Let md be the height at which d is first merged, and M is the height of the final merge, then</Paragraph>
    <Paragraph position="2"> Intuitively, the agglomerative coefficient measures the average similarity of d to the members of the first cluster containing d, normalized to a [0,1] range. For document collections of approximately equal size, a larger a indicates better clustering quality (Kaufman and Rousseeuw, 1990).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Dimension Reduction
</SectionTitle>
      <Paragraph position="0"> The number of unique terms in document collection is typically large (&gt; 1000), resulting in very high dimensional data. Dimension reduction is commonly employed in text mining before further analysis.</Paragraph>
      <Paragraph position="1"> Principal components analysis (PCA) and related approaches are methods for dimension reduction (Jolliffe, 1986). A full discussion of PCA is beyond the scope of this paper. Several guidelines exist for PCA to determine the number of dimensions to use. In this study, principal components are selected in descending order until 25% of the variation in the data is captured.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Document Similarity
</SectionTitle>
      <Paragraph position="0"> Many clustering algorithms require a measure of similarity between two documents be defined. Euclidean distance is one measure used in clustering applications. Another measure, used in information retrieval, is the cosine measure (Korfhage, 1997), which measures similarity by calculating the cosine of the angle between the vector representation of two documents. Cosine distance is used in this paper.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Cluster Identification and Summarization
</SectionTitle>
      <Paragraph position="0"> For MeSH representations, clusters are identified and summarized to find interesting groups in the document collection. Individual clusters are identified by cutting the dendrogram at different heights. The clusters are then summarized by computing the cluster center, a vector consisting of the mean term weights across constituent documents, using the full dimensional representation.</Paragraph>
      <Paragraph position="1"> Terms are ranked in descending order according to the resulting mean weight.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Experiments and Results
</SectionTitle>
    <Paragraph position="0"> Two document collections were analyzed using document clustering: documents in RGD (Twigger et al., 2002), and documents retrieved by the PubMed query &amp;quot;Tourette's Syndrome.&amp;quot; Each data set is described in more detail below.</Paragraph>
    <Paragraph position="1"> The following procedure was employed for each collection. null 1. Documents are encoded in a vector representation. The term collection T is derived from terms in abstracts and titles, MeSH descriptors, MeSH qualifiers, or a combination of MeSH descriptors and qualifiers.</Paragraph>
    <Paragraph position="2"> For full-text, terms from abstracts and titles were obtained using rainbow with stop word removal and stemming options (McCallum, 1996). TF weighting was used.</Paragraph>
    <Paragraph position="3"> For the MeSH descriptors and qualifiers, the assignments were obtained from PubMed XML entries, and inferring was determined by the 2003 MeSH.</Paragraph>
    <Paragraph position="4">  2. PCA was performed on the represented documents, and principal components capturing 25% of the data variance were selected. The documents were projected onto the selected components.</Paragraph>
    <Paragraph position="5"> 3. The reduced dimension representation was clustered  by AGNES using average linking. The cosine distance measure was used for document similarity. 4. Clusters were identified and summarized.</Paragraph>
    <Paragraph position="6"> Computations were performed using R version 1.7.1 (R Development Core Team, 2003). Clustering was accomplished using theagnesfunction in thecluster package. PCA calculations used the prcomp function in the mva package.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Rat Genome Database
</SectionTitle>
      <Paragraph position="0"> The Rat Genome Database (RGD) is a NIH (National Institutes of Health) project developed at Medical College of Wisconsin (MCW) whose main objective is to collect, consolidate and integrate data generated from rat research (Twigger et al., 2002). Rat is the dominant preclinical model organism used to study human diseases involving heart, lung, kidney, blood and vasculature, such as hypertension and renal failure. Researchers at MCW curate approximately 200 articles from 30 journals every month. This is a small portion of the 1200 articles published on rat research every month. The concepts embodied by this document collection are ill defined. Several conversations with the RGD curators resulted in no clear specification of interests or search terms.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Document Representaion
</SectionTitle>
      <Paragraph position="0"> samples.</Paragraph>
      <Paragraph position="1"> A comparative study of full text (abstracts and titles), MeSH descriptors, MeSH qualifiers, and a combined MeSH descriptors and qualifiers representation was performed. The document collection consists of 2713 papers. The term collection T for the full-text representation contained 17177 unique terms after stemming and stop word removal; and for the MeSH representations, T contained 5013 descriptors and 64 qualifiers. After PCA, the number of principal components used for the descriptors, qualifiers, combined, and full-text representations are 16, 2, 62, and 37 respectively.</Paragraph>
      <Paragraph position="2"> The clustering quality of each representation was evaluated using 20 bootstrap samples (i.e., sampling with replacement) of size 2713 from the 2713 documents. Each sample was represented and clustered. The resulting agglomerative coefficients were tabulated (Table 1). To show a significant difference in the agglomerative coefficients obtained between MeSH representations and the full-text representation, the Wilcoxon rank sum test, a non-parametric version of the paired t-test, was applied.</Paragraph>
      <Paragraph position="3"> The p-values in Table 2 indicate that each of the MeSH representations are significantly different than the full-text representation. By observing that larger agglomerative coefficients indicate higher quality clustering, we conclude that MeSH representations offer higher quality clustering than the full-text representation.</Paragraph>
      <Paragraph position="4"> The full text and combined MeSH representations are further explored. Dendrograms for the full text representation (Figure 4) and combined representation (Figure 3) show the structure of the document collection. The combined representation results in two clearly distinct clus- null tation, average linking, and cosine distance. The vertical axis represents the intercluster distance, or height, at which the clusters are merged.</Paragraph>
      <Paragraph position="5"> ters identified at height 1.0. Furthermore, the tree contains several small and tight clusters at a low height, indicating the existence of possible subconcepts. In contrast, the resulting tree for the full text representation does not reveal the same structure, suggesting subconcepts are not clearly identified.</Paragraph>
      <Paragraph position="6"> Depicted in Figures 5 and 6 are two dimensional scatterplots of the documents projected on the first two principal components of the combined representation and full-text representation respectively. These plots also show a structure with the descriptors and qualifiers representation, there are two distinguished clusters with few outliers. The two clusters in the dendrogram of the combined representation correspond to the left and right groups seen in the scatterplot.</Paragraph>
      <Paragraph position="7"> Table 3 presents summary description of the clusters found for the combined representation. Terms with a weight &gt; 0.5 are included. The summary describes the two major groups of papers: one related to sequence and molecular techniques; the other related to metabolism, biochemical phenomena and physiology.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Tourette's Syndrome
</SectionTitle>
      <Paragraph position="0"> A second, exploratory study was performed on a document collection about the disease Tourette's Syndrome.</Paragraph>
      <Paragraph position="1"> Only the results of using the combined representation are presented here.</Paragraph>
      <Paragraph position="2">  resents the intercluster distance, or height, at which the clusters are merged.</Paragraph>
      <Paragraph position="3"> terized by motor and vocal tics and associated behavioral abnormalities. Chromosomes 2, 7, 11, and 18 have been implicated in causal effects of the disease (OMIM, 2003). The collection was obtained using the query &amp;quot;Tourette's Syndrome&amp;quot; on PubMed, resulting in 2241 papers. The term collection for the combined representation consists of 6524 MeSH descriptors and 76 MeSH qualifiers. Only 8 principal components were required to capture 25% of the variance in the data set.  tinct clusters of documents exist at a height of 1.0. The leftmost cluster in the tree could be split again at a height of approximately 0.9. The clusters at lower heights are not as tightly defined as those in the RGD study, indicating more diversity in the document contents.</Paragraph>
      <Paragraph position="4"> Summaries of the three clusters are given in Table 4. In all three clusters, terms associated with Tourettes Syndrome appear with a weight &gt; 0.5 in the cluster center. Documents in the left cluster appear to focus on the psychology and diagnosis associated with the disease, discussing all age groups and genders. The middle cluster consists of papers associated with the genetics and physiopathological diagnosis of Tourette's Syndrome. Of particular interest is the lack of age and gender terms, meaning the papers do not represent consistent themes in ages or genders. Papers associated with drug therapy and pharmacological studies comprise the right cluster, again spanning all age groups and genders. It should be noted that Tourette's Syndrome patients show a therapeutic response to Haloperidol (OMIM, 2003).</Paragraph>
      <Paragraph position="5"> The three identified clusters are represented by 1, 2 (in the bottom center of the plot), and 3 in Figure 8, a scatterplot projected onto the first two principal components. The scatterplot along the first two principal components show a correspondence to the dendrogram: 1's</Paragraph>
      <Paragraph position="7"> ing the combined MeSH representation. The x and y axes are the first two principal components.</Paragraph>
      <Paragraph position="8"> correspond to the left cluster in the tree; 2's to the middle cluster; and 3's to the right cluster. The scatterplot suggests the existence of smaller clusters, which agrees with the hierarchical clustering results.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Related Work
</SectionTitle>
    <Paragraph position="0"> Srinivasan has extensively investigated the use of MeSH for classification and text mining (Srinivasan, 2001; Ruiz and Srinivasan, 2002; Srinivasan and Rindflesch, 2002; Ruiz and Srinivasan, 2003; Srinivasan and Wedemeyer, 2003; Srinivasan, to appear). Of particular interest is the work on concept profiles to provide targeted summaries of document collections. In comparison to our work, concept profiles provide a global insight into a document collection, whereas document clustering can provide insight into important groups within a document collection.</Paragraph>
    <Paragraph position="1"> Document clustering of medical literature in full-text representations has been used for functional annotation of gene products (Renner and Asz'odi, 2000) and concept discovery (Iliopoulos et al., 2001). In the latter paper, the authors ignore MeSH, arguing that it is not updated or may not capture the document contents. In our study, we found MeSH indexed documents without abstracts, suggesting that clustering with MeSH terms is complementary work. MeSH descriptors have been considered as additional features in document clustering (Wilbur, 2002), but the hierarchical relationships of MeSH are not used.</Paragraph>
    <Paragraph position="2"> Ontology-based clustering has been considered (Hotho et al., 2001). In this work, terms are selected from the ontology based on frequency, employing the parent-child relationships. Adapting this work to MeSH may be inter-</Paragraph>
    <Paragraph position="4"> ing the full-text representation. The x and y axes are the first two principal components.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML