File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-3104_evalu.xml
Size: 7,671 bytes
Last Modified: 2025-10-06 13:59:20
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3104"> <Title>A Study of Text Categorization for Model Organism Databases</Title> <Section position="5" start_page="0" end_page="3" type="evalu"> <SectionTitle> 4 Results and Discussion </SectionTitle> <Paragraph position="0"> Table 3 shows the detail F-measure obtained for each combination of machine learning algorithm, year, and feature representation. Among them, Support Vector machine along with stemmed words in abstracts achieved the best F-measure (i.e., 90.5%). Decision list learning along with stemmed words in titles achieved the second best F-measure (i.e., 90.1%). Feature representation using Mesh Headings along with Decision list learning or Support Vector machine has the third best F-measure (i.e., 88.7%). Feature representation using Author combined with Support Vector Machine has an F-measure of 71.8%. Feature representation using Journals has the lowest F-measure (i.e., 62.1%). From Table 3, we can see that Support Vector Machine has the best performance for almost each feature representation.</Paragraph> <Paragraph position="1"> Note that the results for feature representation Authors were significantly worse for year 2002.</Paragraph> <Paragraph position="2"> After reviewing some citations, we found that the format of the author field has changed since year 2002 in MEDLINE citations. The current format results in less ambiguity among authors. However, we could not use the author fields of citations from previous years to predicate the category of documents for year 2002. Also, since a lot of citations in years 2002 and 2003 are in-process citations (i.e., people are still working on indexing these citations using Mesh Headings), feature representation using Mesh Headings had worse performance in these two years comparing to other years.</Paragraph> <Paragraph position="3"> According to the reported performance, we explored the following feature representations: i).</Paragraph> <Paragraph position="4"> stemmed words from titles and stemmed words from abstracts, ii) Mesh Headings, stemmed words from titles, and stemmed words from abstracts, iii) Authors, Mesh Headings, stemmed words from titles, and stemmed words from abstracts, and iv) Journals, Authors, Mesh Headings, stemmed words from titles and stemmed words from abstracts.</Paragraph> <Paragraph position="5"> Figure 2 shows the performance of these feature representations when using support vector machine as machine learning algorithm. Note that the F-measures for complex feature representations that contain Abstract, Title, and Mesh Headings are indistinguishable. The inclusion of addition features such as Authors or Journals does not improve F-measure visibly. Figure 2 also includes the measure for keyword retrieving, which is different from the measure for each complex feature representation. The performance of keyword retrieving is measured using the ratio of the number of citations in each organism that contain keywords from the list of keywords obtained for that organism to the total number of citations for the organism. The measure for each complex feature representation is the F-measure obtained using support vector ma.</Paragraph> <Paragraph position="6"> The F-measure of classifiers using complex feature representations learned using Support ne and the percentage of the number of citations containing keywords associated with organism comparing to the total number of citations associated with that organism. complex feature representations Abstract+Title+MeshHeading, Abe+MeshHeading+Author, and Abstract+Title+MeshHeading+Author+Journal are overchine which was trained using citations from all previous years and tested using citations in current year.</Paragraph> <Paragraph position="7"> From the study, we answered at least partially the questions. We cannot just simply use keywords to retrieve MEDLINE citations for model organism databases. From Figure 2, we can see that using keywords to retrieve citations may miss 20% of the citations. However, when combining all feature representation together, using citations from previous years could correctly predict to which organism the current year citations belong with an overall F-Measure of 94.1%.</Paragraph> <Paragraph position="8"> For the supervised learning on text categorization task, different MEDLINE citation fields have different power on predicting to which model organism the paper belongs. Feature representation using stemmed words from abstracts has the most stable and highest predicting power with an overall F-measure of 90.5%. Authors alone can predict the category with an overall F-measure of 71.8%.</Paragraph> <Paragraph position="9"> Among three supervised machine learning algorithms, support vector machine achieves the best performance. For feature representations where there are only a few features in a feature vector with non-zero values, decision list learning achieved comparable performance with (sometimes superior than) support vector machine. For example, decision list learning achieved an F-measure of 90.1% when using stemmed words from titles as feature representation method, which is superior than support vector machine (with an F-measure of 88.5%). Consistent with our findings in (Liu, 2004), the performance of Naive Bayes learning is very unstable. For example, when using stemmed words from abstracts, the performance of Naive Bayes learning is comparable to the other two machine learning algorithms. However, when using Mesh Headings as feature representation methods, the performance of Naive Bayes learning (with an F-measure of 82.1%) is much worse than decision list learning and support vector machine (with F-measures of over 88.0%).</Paragraph> <Paragraph position="10"> One limitation of the study is that we used only abstracts that are about one of the four model organisms. The evaluation would be more meaningful if we could include abstracts that are outside of these four model organisms. However, such evaluation would involve human experts since we can not grantee that abstracts that are not included in these four model organism databases are not about one of the four model organisms. That is also the reason we cannot provide F-measures when we evaluated the performance of keyword retrieving since we cannot grantee that abstracts associated with one organism are not related to another organism since the list of references in each organism database is not complete.</Paragraph> <Paragraph position="11"> We could use previous published articles together with their categories to predict categories of the current articles where the list of categories is not limited to model organisms. It could be other categories such as the main themes for each paragraph in each paper. We will conduct a serial of studies on text categorization in the biomedical literature under the condition of the availability of category-labeled examples. One future project would be to apply text categorization on citation information for the protein family classification and annotation in Protein Information Resources (Wu, 2003).</Paragraph> <Paragraph position="12"> As we know, homologous genes are usually represented in text using the same terms. Knowing to which organism the paper belongs can reduce the ambiguity of biological entity terms. For example, if we know the paper is related to mouse, we can use entities that are specific to mouse for biological entity tagging. Future work will be combining text categorization with the task of biological entity tagging to reduce the ambiguity of biological entity names.</Paragraph> </Section> class="xml-element"></Paper>