File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3317_metho.xml

Size: 14,350 bytes

Last Modified: 2025-10-06 14:10:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3317">
  <Title>Using Dependency Parsing and Probabilistic Inference to Extract Relationships between Genes, Proteins and Malignancies Implicit Among Multiple Biomedical Research Abstracts</Title>
  <Section position="3" start_page="104" end_page="106" type="metho">
    <SectionTitle>
2 System Overview
</SectionTitle>
    <Paragraph position="0"> For the purpose of running initial experiments with the BioLiterate system, we restricted our attention to texts from the domain of molecular genetics of oncology, mostly selected from the Pub-MEd subset selected for the PennBioNE project (Mandel, 2006). Of course, the BioLiterate architecture in general is not restricted to any particular type or subdomain of texts.</Paragraph>
    <Paragraph position="1"> The system is composed of a series of components arranged in a pipeline: Tokenizer !Gene,  It is worth noting that inference which appear conceptually to be &amp;quot;straightforward deductions&amp;quot; often manifest themselves within BioLiterate as PLN inference chains with 1-2 dozen inferences. This is mostly because of the relatively complex way in which logical relationships emerge from semantic mapping, and also because of the need for inferences that explicitly incorporate &amp;quot;obvious&amp;quot; background knowledge.</Paragraph>
    <Paragraph position="2">  Each component, excluding the semantic mapper and probabilistic reasoner, is realized as a UIMA (Gotz and Suhre, 2004) annotator, with information being accumulated in each document as each phase occurs.</Paragraph>
    <Paragraph position="3">  The gene/protein and malignancy taggers collectively constitute our &amp;quot;entity extraction&amp;quot; subsystem. Our entity extraction subsystem and the tokenizer were adapted from PennBioTagger (McDonald et al, 2005; Jin et al, 2005; Lerman et al, 2006). The tokenizer uses a maximum entropy model trained upon biomedical texts, mostly in the oncology domain. Both the protein and malignancy taggers were built using conditional random fields.</Paragraph>
    <Paragraph position="4"> The nominalization tagger detects nominalizations that represent possible relationships that would otherwise go unnoticed. For instance, in the sentence excerpt &amp;quot;... intracellular signal transduction leading to transcriptional activation...&amp;quot; both &amp;quot;transduction&amp;quot; and &amp;quot;activation&amp;quot; are tagged. The nominalization tagger uses a set of rules based on word morphology and immediate context.</Paragraph>
    <Paragraph position="5"> Before a sentence passes from these early processing stages into the dependency extractor, which carries out syntax parsing, a substitution process is carried out in which its tagged entities are replaced with simple unique identifiers. This way, many text features that often impact parser performance are left out, such as entity names that have numbers or parenthesis as post-modifiers.</Paragraph>
    <Paragraph position="6"> The dependency extractor component carries out dependency grammar parsing via a customized version of the open-source Sleator and Temperley link parser (1993). The link parser outputs several parses, and the dependencies of the best one are taken.</Paragraph>
    <Paragraph position="7">  The relationship extractor component is composed of a number of template matching algorithms that act upon the link parser's output to produce a semantic interpretation of the parse. This component detects implied quantities, normalizes passive and active forms into the same representa- null The semantic mapper will be incorporated into the UIMA framework in a later revision of the software.</Paragraph>
    <Paragraph position="8">  We have experimented with using other techniques for selecting dependencies, such as getting the most frequent ones, but variations in this aspect did not impact our results significantly.</Paragraph>
    <Paragraph position="9"> tion and assigns tense and number to the sentence parts. Another way of conceptualizing this component is as a system that translates link parser dependencies into a graph of semantic primitives (Wierzbicka, 1996), using a natural semantic meta-language (Goddard, 2002).</Paragraph>
    <Paragraph position="10"> Table 1 below shows some of the primitive semantic relationships used, and their associated link parser links:  For a concrete example, suppose we have the sentences: a) Kim kissed Pat.</Paragraph>
    <Paragraph position="11"> b) Pat was kissed by Kim.</Paragraph>
    <Paragraph position="12"> Both would lead to the extracted relationships: subj(kiss, Kim), obj(kiss, Pat) For a more interesting case consider: c) Kim likes to laugh.</Paragraph>
    <Paragraph position="13"> d) Kim likes laughing.</Paragraph>
    <Paragraph position="14"> Both will have a to-do (like, laugh) semantic relation.</Paragraph>
    <Paragraph position="15"> Next, this semantic representation, together with entity information, is feed into the Semantic Mapper component, which applies a series of hand-created rules whose purpose is to transform the output of the Relationship Extractor into logical relationships that are fully abstracted from their syntactic origin and suitable for abstract inference. The need for this additional layer may not be apparent a priori, but arises from the fact that the output of the Relationship Extractor is still in a sense &amp;quot;too close to the syntax.&amp;quot; The rules used within the Relationship Extractor are crisp rules with little context-dependency, and could fairly easily be built into a dependency parser (though the link parser is not architected in such a way as to make this pragmatically feasible); on the other  hand, the rules used in the Semantic Mapper are often dependent upon semantic information about the words being interrelated, and would be more challenging to integrate into the parsing process. As an example, the semantic mapping rule</Paragraph>
    <Paragraph position="17"> maps the relationship by(prevention, inhibition), which is output by the Relationship Extractor, into the relationship subj(prevention, inhibition), which is an abstract conceptual relationship suitable for semantic inference by PLN. It performs this mapping because it has knowledge that &amp;quot;prevention&amp;quot; inherits (Inh) from the semantic category transitive_event, which lets it guess what the appropriate sense of &amp;quot;by&amp;quot; might be.</Paragraph>
    <Paragraph position="18"> Finally, the last stage in the BioLiterate pipeline is probabilistic inference, which is carried out by  (Goertzel et al, in preparation) implemented within the Novamente AI Engine integrated AI architecture (Goertzel and Pennachin, 2005; Looks et al, 2004). PLN is a comprehensive uncertain inference framework that combines probabilistic and heuristic truth value estimation formulas within a knowledge representation framework capable of expressing general logical information, and possesses flexible inference control heuristics including forward-chaining, backward-chaining and reinforcement-learning-guided approaches.</Paragraph>
    <Paragraph position="19"> Among the notable aspects of PLN is its use of two-valued truth values: each PLN statement is tagged with a truth value containing at least two components, one a probability estimate and the other a &amp;quot;weight of evidence&amp;quot; indicating the amount of evidence that the probability estimate is based on. PLN contains a number of different inference rules, each of which maps a premise-set of a certain logical form into a conclusion of a certain logical form, using an associated truth-value formula to map the truth values of the premises into the truth value of the conclusion.</Paragraph>
    <Paragraph position="20"> The PLN component receives the logical relationships output by the semantic mapper, and performs reasoning operations on them, with the aim at arriving at new conclusions implicit in the set of relationships fed to it. Some of these conclusions  Previously named Probabilistic Term Logic may be implicit in a single text fed into the system; others may emerge from the combination of multiple texts.</Paragraph>
    <Paragraph position="21"> In some cases the derivation of useful conclusions from the semantic relationships fed to PLN requires &amp;quot;background knowledge&amp;quot; relationships not contained in the input texts. Some of these background knowledge relationships represent specific biological or medical knowledge, and others represent generic &amp;quot;commonsense knowledge.&amp;quot; The more background knowledge is fed into PLN, the broader the scope of inferences it can draw.</Paragraph>
    <Paragraph position="22"> One of the major unknowns regarding the current approach is how much background knowledge will need to be supplied to the system in order to enable truly impressive performance across the full range of biomedical research abstracts. There are multiple approaches to getting this knowledge into the system, including hand-coding (the approach we have taken in our BioLiterate work so far) and automated extraction of relationships from relevant texts beyond research abstracts, such as databases, ontologies and textbooks. While this is an extremely challenging problem, we feel that due to the relatively delimited nature of the domain, the knowledge engineering issues faced here are far less severe than those confronting projects such as Cyc (Lenat, 1986; Guha, 1990; Guha, 1994) and SUMO (Niles, 2001) which seek to encode commonsense knowledge in a broader, non-domain-specific way.</Paragraph>
  </Section>
  <Section position="4" start_page="106" end_page="108" type="metho">
    <SectionTitle>
3 A Practical Example
</SectionTitle>
    <Paragraph position="0"> We have not yet conducted a rigorous statistical evaluation of the performance of the BioLiterate system. This is part of our research plan, but will involve considerable effort, due to the lack of any existing evaluation corpus for the tasks that BioLiterate performs. For the time being, we have explored BioLiterate's performance anecdotally via observing its behavior on various example &amp;quot;inference problems&amp;quot; implicit in groups of biomedical abstracts. This section presents one such example in moderate detail (full detail being infeasible due to space limitations).</Paragraph>
    <Paragraph position="1"> Table 2 shows two sentences drawn from different PubMed abstracts, and then shows the conclusions that BioLiterate draws from the combination of these two sentences. The table shows the conclusions in natural language format, but the system  actually outputs conclusions in logical relationship form as detailed below.</Paragraph>
    <Paragraph position="2"> Premise 1 Importantly, bone loss was almost completely prevented by p38 MAPK inhibition. (PID 16447221) Premise 2 Thus, our results identify DLC as a novel inhibitor of the p38 pathway and provide a molecular mechanism by which cAMP suppresses p38 activation and promotes apoptosis. (PID  via combining relationships extracted from sentences contained in different PubMed abstracts. The PID shown by each premise sentence is the PubMed ID of the abstract from which it was drawn.</Paragraph>
    <Paragraph position="3"> Tables 3-4 explore this example in more detail. Table 3 shows the relationship extractor output, and then the semantic mapper output, for the two premise sentences.</Paragraph>
    <Paragraph position="4">  premise sentences in the example in Table 2.</Paragraph>
    <Paragraph position="5"> Table 4 shows a detailed &amp;quot;inference trail&amp;quot; constituting part of the reasoning done by PLN to draw the inference &amp;quot;DLC prevents bone loss&amp;quot; from these extracted semantic relationships, invoking background knowledge from its knowledge base as appropriate. null The notation used in Table 4 is so that, for instance, Inh inhib inhib is synonymous with inh(inhib , inhib ) and denotes an Inheritance relationship between the terms inhibition and inhibition (the textual shorthands used in the table are described in the caption). The logical relationships used are Inheritance, Implication, AND (conjunction) and Evaluation. Evaluation is the relation between a predicate and its arguments; e.g. Eval subj(inhib , DLC) means that the subj predicate holds when applied to the list (inhib , DLC). These particular logical relationships are reviewed in more depth in (Goertzel and Pennachin, 2005; Looks et al, 2004). Finally, indent notation is used to denote argument structure, so that e.g.</Paragraph>
    <Paragraph position="6">  means that each of the terms and relationships used as premises, conclusions or intermediaries in PLN inference come along with uncertain truth values. In this case the truth value of the conclusion at the end of Table 4 comes out to &lt;.8,.07&gt;, which indicates that the system guesses the conclusion is true with probability .8, and that its confidence that this probability assessment is roughly correct is .07. Confidence values are scaled between 0 and 1: .07 is a relatively low confidence, which is appropriate given the speculative nature of the inference. Note that this is far higher than the confidence that would be attached to a randomly generated relationship, however.</Paragraph>
    <Paragraph position="7"> The only deep piece of background knowledge utilized by PLN in the course of this inference is the knowledge that:  ) which encodes the transitivity of causation in terms of the subj relationship. The other knowledge  used consisted of simple facts such as the inheritance of inhibition and prevention from the category causal_event.</Paragraph>
    <Paragraph position="8">  up to the conclusion that the prevention act prev1 is carried out by the subject DLC. A shorthand notation is used here: Eval = Evaluation, Imp = Implication, Inh = Inheritance, inhib = inhibition, prev = prevention. For instance, prev  denote terms that are particular instances of the general concept of prevention. Relationships used in premises along the trail, but not produced as conclusions along the trail, were introduced into the trail via the system looking in its knowledge base to obtain the previously computed truth value of a relationship, which was found via prior knowledge or a prior inference trail.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML