File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/p04-1025_evalu.xml
Size: 5,384 bytes
Last Modified: 2025-10-06 13:59:09
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1025"> <Title>Extracting Regulatory Gene Expression Networks from PubMed</Title> <Section position="4" start_page="0" end_page="0" type="evalu"> <SectionTitle> 4 Results </SectionTitle> <Paragraph position="0"> Using our relation extraction rules, we were able to extract 422 relation chunks from our complete corpus. Since one entity chunk can mention several different named entities, these corresponded to a total of 597 extracted pairwise relations. However, as several relation chunks may mention the same pairwise relations, this reduces to 441 unique pairwise relations comprised of 126 up-regulations, 90 down-regulations, and 225 regulations of unknown direction.</Paragraph> <Paragraph position="1"> Figure 2 displays these 441 relations as a regulatory network in which the nodes represent genes or proteins and the arcs are expression regulation relations. Known transcription factors according to the Saccharomyces Genome Database (SGD) (Dwight et al., 2002) are denoted by black nodes.</Paragraph> <Paragraph position="2"> From a biological point of view, it is reassuring that these tend to correspond to proteins serving as regulators in our relations.</Paragraph> <Paragraph position="3"> lation The extracted relations are shown as a directed graph, in which each node corresponds to a gene or protein and each arc represents a pairwise relation. The arcs point from the regulator to the target and the type of regulation is specified by the type of arrow head. Known transcription factors are highlighted as black nodes.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Evaluation of relation extraction </SectionTitle> <Paragraph position="0"> To evaluate the accuracy of the extracted relation, we manually inspected all relations extracted from the evaluation corpus using the TIGERSearch visualization tool (Lezius, 2002).</Paragraph> <Paragraph position="1"> The accuracy of the relations was evaluated at the semantic rather than the grammatical level. We thus carried out the evaluation in such a way that relations were counted as correct if they extracted the correct biological conclusion, even if the analysis of the sentence is not as to be desired from a linguistic point of view. Conversely, a relation was counted as an error if the biological conclusion was wrong.</Paragraph> <Paragraph position="2"> 75 of the 90 relation chunks (83%) extracted from the evaluation corpus were entirely correct, meaning that the relation corresponded to expression regulation, the regulator (R) and the regulatee (X) were correctly identified, and the direction of regulation (up or down) was correct if extracted.</Paragraph> <Paragraph position="3"> Further 6 relation chunks extracted the wrong direction of regulation but were otherwise correct; our accuracy increases to 90% if allowing for this minor type of error. Approximately half of the errors made by our method stem from overlooked genetic modifications--although mentioned in the sentence, the extracted relation is not biologically relevant.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Entity recognition </SectionTitle> <Paragraph position="0"> For the sake of consistency, we have also evaluated our ability to correctly identify named entities at the level of semantic rather than grammatical correctness. Manual inspection of 500 named entities from the evaluation corpus revealed 14 errors, which corresponds to an estimated accuracy of just over 97%. Surprisingly, many of these errors were commited when recognizing proteins, for which our accuracy was only 95%. Phrases such as &quot;telomerase associated protein&quot; (which got confused with &quot;telomerase protein&quot; itself) were responsible for about half of these errors.</Paragraph> <Paragraph position="1"> Among the 153 entities involved in relations no errors were detected, which is fewer than expected from our estimated accuracy on entity recognition (99% confidence according to hypergeometric test). This suggests that the templates used for relation extraction are unlikely to match those sentence constructs on which the entity recognition goes wrong. False identification of named entities are thus unlikely to have an impact on the accuracy of relation extraction.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 POS-tagging and tokenization </SectionTitle> <Paragraph position="0"> We compared the POS-tagging performance of two parameter files on 55,166 tokens from the GENIA corpus that were not used for retraining. Using the retrained tagger, 93.6% of the tokens were correctly tagged, 4.1% carried questionable tags (e.g. confusing proper nouns for common nouns), and 2.3% were clear tagging errors. This compares favourably to the 85.7% correct, 8.5% questionable tags, and 5.8% errors obtained when using the Standard English parameter file. Retraining thus reduced the error rate more than two-fold. Of 198 sentences evaluated, the correct sentence boundary was detected in all cases. In addition, three abbreviations incorrectly resulted in sentence marker, corresponding to an overall precision of 98.5%.</Paragraph> </Section> </Section> class="xml-element"></Paper>