File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3104_metho.xml
Size: 7,243 bytes
Last Modified: 2025-10-06 14:09:31
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3104"> <Title>A Study of Text Categorization for Model Organism Databases</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Material and Methods </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Model Organism Databases </SectionTitle> <Paragraph position="0"> The research done here is based on MEDLINE references associated with four model organisms (i.e., mouse, fly, worm and yeast) obtained from downloaded literature reference information from each database on March 2003. All databases provide PMID (unique identifier for MEDLINE citations) information except WormBase where some references use MEDLINEID (another unique identifier for MEDLINE citations) as reference identifiers, some references use PMID as reference identifiers. Meanwhile, about two thirds of the references in WormBase do not have reference identifiers to MEDLINE, which we eliminated in our study since we were not able to get the MEDLINE Figure 1. References for four organism databases from 1966 to 2002. X-axis represents years from 1965 to 2003 in ascending order. The Y axis in figure (a) represents the number of citations. The Y-axis in figure (b) represents the proportion of each year comparing to the total number of citations for a specific organism.</Paragraph> <Paragraph position="1"> (b) citation information. We then used e-Fetch tools provided by Entrez (http://entrez.nlm.nih.gov) and Finally, we obtained 31,414 MEDLINE citations from Flybase, 26,046 from SGD, 3,926 from WormBase, and 48,458 from MGD. Figure 1 lists the statistical information according to the publication date for each organism, where X-axis represents year, Y-axis in Fig. 1(a) represents the number of citations and Y-axis in Fig. 1(b) represents the percentage of the number of citations to the total number of citations for each organism. Note that there were 1,005 citations holding multiple categories (15 of them were referred by mouse, fly and yeast, 1 referred by fly, worm, and mouse, 338 referred by mouse and yeast, 282 referred by fly and yeast, 310 referred by fly and mouse, 9 referred by worm and yeast, 36 referred by fly and worm, 5 referred by mouse and worm). However, comparing to the total of 109,844 citations, there were less than 1% of citations with multiple categories. For simplicity, we defined our categorization task as a single category text categorization task.</Paragraph> <Paragraph position="2"> (Tra), testing (Te) for each year. Note that some fields in certain MEDLINE citations may be empty (e.g., not all references have abstracts), the number of these non-applicable citations for fea-</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Methods </SectionTitle> <Paragraph position="0"> We studied Taxonomy from NCBI (http://www.ncbi.nlm.nih.gov) and UMLS knowledge sources (http://umlsks.nlm.nih.gov) and derived a list of keywords for each organism and used them to retrieve relevant articles. If the title, the abstract or Mesh Headings of a MEDLINE citation contains these keywords, we considered it as a relevant article. Table 1 shows the list of key-words we obtained for each model organism.</Paragraph> <Paragraph position="1"> MEDLINE citations also contain other information such as Authors, Mesh Headings, and Journals etc besides abstracts and titles. Based on the intuitions that biologists tend to use the same organism for their research and a specific journal tend to publish papers in a limited number of areas, we also evaluated Authors and Journals as features.</Paragraph> <Paragraph position="2"> Additionally, Mesh Headings which were assigned manually by librarians to index papers represent key information of the papers, we also evaluated the categorization power of Mesh Headings in determining which organism the paper belongs to.</Paragraph> <Paragraph position="3"> We then combined some or all features together and evaluated the prediction power.</Paragraph> <Paragraph position="5"> years and tested using citations in the Table 2 lists the detail about t the test set f following feature represen from ture representations abstracts (AbT), titles (ArT), authors (Aut), Journals (Jou), and Mesh Headings (MH) for each year.</Paragraph> <Paragraph position="6"> MOUSE Mouse, mice, mus muscaris, mus musculus, mus sp YEAST Saccharomyces, yeast, yeasts, candida robusta, oviformis, italicus, capensis, uvarum, erevisiae null FLY drosophila, fly, flies WORM Elegans, worm, worms articles for four model organisms mouse, yeast, fly and worm.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Experiments </SectionTitle> <Paragraph position="0"> ear from 1990 to 2003, we trained a er using citations published in all previous current year.</Paragraph> <Paragraph position="1"> he training set and or each year. We experimented the tations: stemmed words AbstractText, stemmed words from Title, Author, MeshHeading, and Journals. Since some of the MEDLINE fields may be empty (such as some citations do not contain abstracts), Table 2 also provides the number of non-applicable references each year for a given feature representation method. From Table 2, we found that every citation has a title. However, there are about 6.4% of citations (5,647 out of 88,281) that do not have abstracts. For each feature representation, we applied three supervised learning algorithms (i.e., Naive Bayes learning, Decision List learning, Support Vector Machine).</Paragraph> <Paragraph position="2"> For each combination of machine learning algorithm and feature representation, we computed the performance using the F-measure, which is defined as 2*P*R/(P+R), where P is the precision (the number of citations predicted correctly to the total number of citations being predicted) and R is the recall (the number of citations predicted correctly to the total number of citations).</Paragraph> <Paragraph position="3"> We then sorted the feature representations according to their F-measures and gradually combined them into several complex feature representations. The feature vector of a complex feature representation is formed by simply combining the feature vector of its members. For example, suppose the feature vector of feature representation using stemmed words from the title contains an element A and the feature vector of feature representation using stemmed words from the abstract contains an element B, then the feature vector of the complex representation obtained by combining stemmed words from title and stemmed words from abstracts will contain the two elements: Title: A and Abstract: B. These feature representations were then combined with the machine learning algorithm that has the best overall performance to build text categorization classifiers. Similarly, we evaluated these complex feature representations using citations published in all previous years as training citations and tested using citations published in the current year.</Paragraph> </Section> </Section> class="xml-element"></Paper>