File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-2048_metho.xml
Size: 10,351 bytes
Last Modified: 2025-10-06 14:10:11
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-2048"> <Title>BioEx: A Novel User-Interface that Accesses Images from Abstract Sentences</Title> <Section position="3" start_page="189" end_page="189" type="metho"> <SectionTitle> 2. Data Collection </SectionTitle> <Paragraph position="0"> We hypothesize that images reported in a full-text article can be summarized by sentences in the abstract. To test this hypothesis, we randomly selected a total of 329 biological articles that are recently published in leading journals Cell (104), EMBO (72), Journal of Biological Chemistry (92), and Proceedings of the National Academy of Sciences (PNAS) (61). For each article, we e-mailed the corresponding author and invited him or her to identify abstract sentences that summarize image content in that article. In order to eliminate the errors that may be introduced by sentence boundary ambiguity, we manually segmented the abstracts into sentences and sent the sentences as the email attachments.</Paragraph> <Paragraph position="1"> A total of 119 biologists from 19 countries participated voluntarily the annotation to identify abstract sentences that summarize figures or tables from 114 articles (39 Cells, 29 EMBO, 30 Journal of Biological Chemistry, and 16 PNAS), a collection that is 34.7% of the total articles we requested. The responding biologists included the corresponding authors to whom we had sent emails, as well as the first authors of the articles to whom the corresponding authors had forwarded our emails. None of the biologists or authors were compensated.</Paragraph> <Paragraph position="2"> This collection of 114 full-text articles incorporates 742 images and 826 abstract sentences.</Paragraph> <Paragraph position="3"> The average number of images per document is 6.5+-1.5 and the average number of sentences per abstract is 7.2+-1.9. Our data show that 87.9% images correspond to abstract sentences and 66.5% of the abstract sentences correspond to images. The data empirically validate our hypothesis that image content can be summarized by abstract sentences. Since an abstract is a summary of a full-text article, our results also empirically validate that images are important elements in full-text articles. This collection of 114 annotated articles was then used as the corpus to evaluate automatic mapping of abstract sentences to images using the natural language processing approaches described in Section 4.</Paragraph> </Section> <Section position="4" start_page="189" end_page="189" type="metho"> <SectionTitle> 3. BioEx User-Interface Evaluation </SectionTitle> <Paragraph position="0"> In order to evaluate whether biologists would prefer to accessing images from abstract sentence links, we designed BioEx (Figure 1) and two other baseline user-interfaces. BioEx is built upon the PubMed user-interface except that images can be accessed by the abstract sentences.</Paragraph> <Paragraph position="1"> We chose the PubMed user-interface because it has more than 70 million hits a month and represents the most familiar user-interface to biologists. Other information systems have also adapted the PubMed user-interface for similar reasons (Smalheiser and Swanson 1998; Hearst 2003). The two other baseline user-interfaces were the original PubMed user-interface and a modified version of the SummaryPlus userinterface, in which the images are listed as disjointed thumbnails rather than related by abstract sentences.</Paragraph> <Paragraph position="2"> We asked the 119 biologists who linked sentences to images in their publications to assign a label to each of the three user-interfaces to be &quot;My favorite&quot;, &quot;My second favorite&quot;, or &quot;My least favorite&quot;. We designed the evaluation so that a user-interface's label is independent of the choices of the other two user-interfaces.</Paragraph> <Paragraph position="3"> A total of 41 or 34.5% of the biologists completed the evaluation in which 36 or 87.8% of the total 41 biologists judged BioEx as &quot;My favorite&quot;. One biologist judged all three user-interfaces to be &quot;My favorite&quot;. Five other biologists considered SummaryPlus as &quot;My favorite&quot;, two of whom (or 4.9% of the total 41 biologists) judged BioEx to be &quot;My least favorite&quot;.</Paragraph> </Section> <Section position="5" start_page="189" end_page="191" type="metho"> <SectionTitle> 4. Linking Abstract Sentences to Images </SectionTitle> <Paragraph position="0"> We have explored hierarchical clustering algorithms to cluster abstract sentences and image captions based on lexical similarities.</Paragraph> <Paragraph position="1"> Hierarchical clustering algorithms are well-established algorithms that are widely used in many other research areas including biological sequence alignment (Corpet 1988), gene expression analyses (Herrero et al. 2001), and topic detection (Lee et al. 2006). The algorithm starts with a set of text (i.e., abstract sentences or image captions). Each sentence or image caption represents a document that needs to be clustered. The algorithm identifies pair-wise document similarity based on the TF*IDF weighted cosine similarity. It then merges the two documents with the highest similarity into one cluster. It then re-evaluates pairs of documents/clusters; two clusters can be merged if the average similarity across all pairs of documents within the two clusters exceeds a predefined threshold. In presence of multiple clusters that can be merged at any time, the pair of clusters with the highest similarity is always preferred.</Paragraph> <Paragraph position="2"> In our application, if abstract sentences belong to the same cluster that includes images captions, the abstract sentences summarize the image content of the corresponded images. The clustering model is advantageous over other models in that the flexibility of clustering methods allows &quot;many-to-many&quot; mappings. That is a sentence in the abstract can be mapped to zero, one or more than one images and an image can be mapped to zero, one or more than one abstract sentences.</Paragraph> <Paragraph position="3"> We explored different learning features, weights and clustering algorithms to link abstract sentences to images. We applied the TF*IDF weighted cosine similarity for document clustering. We treat each sentence or image caption as a &quot;document&quot; and the features are bag-of-words. We tested three different methods to obtain the IDF value for each word feature: 1) IDF(abstract+caption): the IDF values were calculated from the pool of abstract sentences and image captions; 2) IDF(full-text): the IDF values were calculated from all sentences in the full-text article; and 3) IDF(abstract)::IDF(caption): two sets of IDF values were obtained. For word features that appear in abstracts, the IDF values were calculated from the abstract sentences. For words that appear in image captions, the IDF values were calculated from the image captions.</Paragraph> <Paragraph position="4"> The positions of abstract sentences or images are important. The chance that two abstract sentences link to an image decreases when the distance between two abstract sentences increases. For example, two consecutive abstract sentences have a higher probability to link to one image than two abstract sentences that are far apart.</Paragraph> <Paragraph position="5"> Two consecutive images have a higher chance to link to the same abstract sentence than two images that are separated by many other images.</Paragraph> <Paragraph position="6"> Additionally, sentence positions in an abstract seem to correspond to image positions. For example, the first sentences in an abstract have higher probabilities than the last sentences to link to the first image.</Paragraph> <Paragraph position="7"> To integrate such &quot;neighboring effect&quot; into our existing hierarchical clustering algorithms, we modified the TF*IDF weighted cosine similarity. The TF*IDF weighted cosine similarity for a pair of documents i and j is Sim(i,j), and the final similarity metric W(i,j) is: ( ) ))//(1(*),(, jjii TPTPabsjiSimjiW [?][?]= 1. If i and j are both abstract sentences, Ti=Tj=total number of abstract sentences; and Pi and Pj represents the positions of sentences i and j in the abstract.</Paragraph> <Paragraph position="8"> 2. If i and j are both image captions, Ti=Tj=total number of images that appear in a full-text article; and Pi and Pj represents the positions of images i and j in the full-text article. null 3. If i and j are an abstract sentence and an image caption, respectively, Ti=total number of abstract sentences and Tj=total number of images that appear in a full-text article; and Pi and Pj represent the positions of abstract sentence i and image j.</Paragraph> <Paragraph position="9"> Finally, we explored three clustering strategies; namely, per-image, per-abstract sentence, and mix.</Paragraph> <Paragraph position="10"> The Per-image strategy clusters each image caption with all abstract sentences. The image is assigned to (an) abstract sentence(s) if it belongs to the same cluster. This method values features in abstract sentences more than image captions because the decision that an image belongs to (a) sentence(s) depends upon the features from all abstract sentences and the examined image caption. The features from other image captions do not play a role in the clustering methodology.</Paragraph> <Paragraph position="11"> The Per-abstract-sentence strategy takes each abstract sentence and clusters it with all image captions that appear in a full-text article. Images are assigned to the sentence if they belong to the same cluster. This method values features in image captions higher than the features in abstract sentences because the decision that an abstract sentence belongs to image(s) depends upon the features from the image captions and the examined abstract sentence. Similar to per-image clustering, the features from other abstract sentences do not play a role in the clustering methodology. null The Mix strategy clusters all image captions with all abstract sentences. This method treats features in abstract sentences and image captions equally.</Paragraph> </Section> class="xml-element"></Paper>