File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1025_metho.xml

Size: 12,400 bytes

Last Modified: 2025-10-06 14:09:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1025">
  <Title>Extracting Regulatory Gene Expression Networks from PubMed</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Methods
</SectionTitle>
    <Paragraph position="0"> Table 1 shows an overview of the architecture of our IE system. It is organized in levels such that the output of one level is the input of the next one.</Paragraph>
    <Paragraph position="1"> The following sections describe each level in detail. null</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 The corpus
</SectionTitle>
      <Paragraph position="0"> The PubMed resource was downloaded on January 19, 2004. 58,664 abstracts related to the yeast Saccharomyces cerevisiae were extracted by looking for occurrences of the terms &amp;quot;Saccharomyces cerevisiae&amp;quot;, &amp;quot;S. cerevisiae&amp;quot;, &amp;quot;Baker's yeast&amp;quot;, &amp;quot;Brewer's yeast&amp;quot;, and &amp;quot;Budding yeast&amp;quot; in the title/abstract or as head of a MeSH term3.</Paragraph>
      <Paragraph position="1"> These abstracts were filtered to obtain the 15,777 that mention at least two names (see section 3.4) and subsequently divided into a training and an evaluation set of 9137 and 6640 abstracts respectively. null  tected and multiwords are recognized and recomposed to one token.</Paragraph>
      <Paragraph position="2"> L1 POS-Tagging A part-of-speech tag is assigned to each word (or multiword) of the tokenized corpus.</Paragraph>
      <Paragraph position="3">  A manually built taxonomy is used to assign semantic labels to tokens. The taxonomy consists of gene names, cue words relevant for entity recognition, and classes of verbs for relation extraction.L3 Named entity chunking Based on the POS-tags and the semantic labels, a cascaded chunk grammar recognizes noun chunks relevant for the gene transcription domain, e.g.</Paragraph>
      <Paragraph position="4">  nized, e.g. The expression of the cytochrome genes CYC1 and CYC7 is controlled by HAP1.</Paragraph>
      <Paragraph position="5"> L5 Output and visualization  Information is gathered from the recognised patterns and transformed into pre-defined records. From the example in L4 we extract that HAP1 regulates the expression of CYC1 and CYC7.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Tokenization and multiword detection
</SectionTitle>
      <Paragraph position="0"> The process of tokenization consists of two steps (Grefenstette and Tapanainen, 1994): segmentation of the input text into a sequence of tokens and the detection of sentential boundaries. We use the tokenizer developed by Helmut Schmid at IMS (University of Stuttgart) because it combines a high accuracy (99.56% on the Brown corpus) with unsupervised learning (i.e. no manually labelled data is needed) (Schmid, 2000).</Paragraph>
      <Paragraph position="1"> The determination of token boundaries in technical or scientific texts is one of the main challenges within information extraction or retrieval. On the one hand, technical terms contain special characters such as brackets, colons, hyphens, slashes, etc. On the other hand, they often appear as multiword expressions which makes it hard to detect the left and right boundaries of the terms. Although a lot of work has been invested in the detection of technical terms within biology related texts (see Nenadi'c et al. (2003) or Yamamoto et al. (2003) for representative results) this task is not yet solved to a satisfying extent. As we are interested in very special terms and high precision results we opted for multiword detection based on semi-automatical acquisition of multi-words (see sections 3.4 and 3.5).</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Part-of-speech tagging
</SectionTitle>
      <Paragraph position="0"> To improve the accuracy of POS-tagging on PubMed abstracts, TreeTagger (Schmid, 1994) was retrained on the GENIA 3.0 corpus (Kim et al., 2003). Furthermore, we expanded the POS-tagger lexicon with entries relevant for our application such as gene names (see section 3.4) and multiwords (see section 3.5). As tag set we use the UPenn tag set (Santorini, 1991) plus some minor extensions for distinguishing auxiliary verbs.</Paragraph>
      <Paragraph position="1"> The GENIA 3.0 corpus consists of PubMed abstracts and has 466,179 manually annotated tokens. For our application we made two changes in the annotation. The first one concerns seemingly undecideable cases like in/or annotated as injcc. These were split into three tokens: in, /, and or each annotated with its own tag. This was done because TreeTagger is not able to annotate two POS-tags for one token. The second set of changes was to adapt the tag set so that vb... is used for derivates of to be, vh... for derivates of to have, and vv... for all other verbs.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Recognizing gene/protein names
</SectionTitle>
      <Paragraph position="0"> To be able to recognize gene/protein names as such, and to associate them with the appropriate database identifiers, a list of synonymous names and identifiers in six eukaryotic model organisms was compiled from several sources (available from http://www.bork.embl.</Paragraph>
      <Paragraph position="1"> de/synonyms/). For S. cerevisiae specifically, 51,640 uniquely resolvable names and identifiers were obtained from Saccharomyces Genome Database (SGD) and SWISS-PROT (Dwight et al., 2002; Boeckmann et al., 2003).</Paragraph>
      <Paragraph position="2"> Before matching these names against the POS-tagged corpus, the list of names was expanded to include different orthographic variants of each name. Firstly, the names were allowed to have various combinations of uppercase and lowercase letters: all uppercase, all lowercase, first letter uppercase, and (for multiword names) first letter of each word uppercase. In each of these versions, we allowed whitespace to be replaced by hyphen, and hyphen to be removed or replaced by whitespace. In addition, from each gene name a possible protein name was generated by appending the letterp. The resulting list containing all orthographic variations comprises 516,799 entries.</Paragraph>
      <Paragraph position="3"> The orthographically expanded name list was fed into the multiword detection, the POS-tagger lexicon, and was subsequently matched against the POS-tagged corpus to retag gene/protein names as such (nnpg). By accepting only matches to words tagged as common nouns (nn), the problem of homonymy was reduced since e.g. the name MAP can occur as a verb as well.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.5 Semantic tagging
</SectionTitle>
      <Paragraph position="0"> In addition to the recognition of the gene and protein names, we recognize several other terms and annotate them with semantic tags. This set of semantically relevant terms mainly consists of nouns and verbs, as well as some few prepositions like from, or adjectives like dependent. The first main set of terms consists of nouns, which are classified as follows: Relevant concepts in our ontology: gene, protein, promoter, binding site, transcription factor, etc. (153 entries).</Paragraph>
      <Paragraph position="1"> Relational nouns, like nouns of activation (e.g. derepression and positive regulation), nouns of repression (e.g. suppression and negative regulation), nouns of regulation (e.g.</Paragraph>
      <Paragraph position="2"> affect and control) (69 entries).</Paragraph>
      <Paragraph position="3"> Triggering experimental (artificial) contexts: mutation, deletion, fusion, defect, vector, plasmids, etc. (11 entries).</Paragraph>
      <Paragraph position="4"> Enzymes: gyrase, kinase, etc. (569 entries).</Paragraph>
      <Paragraph position="5"> Organism names extracted from the NCBI taxonomy of organisms (Wheeler et al., 2004) (20,746 entries).</Paragraph>
      <Paragraph position="6"> The second set of terms contains 50 verbs and their inflections. They were classified according to their relevance in gene transcription. These verbs are crucial for the extraction of relations between entities: null Verbs of activation e.g. enhance, increase, induce, and positively regulate.</Paragraph>
      <Paragraph position="7"> Verbs of repression e.g. block, decrease, downregulate, and down regulate.</Paragraph>
      <Paragraph position="8"> Verbs of regulation e.g. affect and control.</Paragraph>
      <Paragraph position="9"> Other selected verbs like code (or encode) and contain where given their own tags.</Paragraph>
      <Paragraph position="10"> Each of the terms consisting of more than one word was utilized for multiword recognition.</Paragraph>
      <Paragraph position="11"> We also have have two additional classes of words to prevent false positive extractions. The first contains words of negation, like not, cannot, etc. The other contains nouns that are to be distinguished from other common nouns to avoid them being allowed within named entitities, e.g. allele and diploid.</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.6 Extraction of named entities
</SectionTitle>
      <Paragraph position="0"> In the preceding steps we classified relevant nouns according to semantic criteria. This allows us to chunk noun phrases generalizing over both POS-tags and semantic tags. Syntacto-semantic chunking was performed to recognize named entities using cascades of finite state rules implemented as a  Other syntactic variants, as for example &amp;quot;the glucokinase gene GLK1&amp;quot; are recognized too. Similarly, we detect at this early level noun chunks denoting other biological entities such as proteins, activators, repressors, transcription factors etc.</Paragraph>
      <Paragraph position="1"> Subsequently, we recognize more complex noun chunks on the basis of the simpler ones, e.g. promoters, upstream activating/repressing sequences (UAS/URS), binding sites. At this point it becomes important to distinguish between agens and patiens forms of certain entities. Since a binding site is part of a target gene, it can be referred to either by the name of this gene or by the name of the regulator protein that binds to it. It is thus necessary to discriminate between &amp;quot;binding site of&amp;quot; and &amp;quot;binding site for&amp;quot;.</Paragraph>
      <Paragraph position="2"> As already mentioned, we have annotated a class of nouns that trigger experimental context.</Paragraph>
      <Paragraph position="3"> On the basis of these we identify noun chunks mentioning, as for example deletion, mutation, or overexpression of genes. At a fairly late stage we recognize events that can occur as arguments for verbs like &amp;quot;expression of&amp;quot;.</Paragraph>
    </Section>
    <Section position="7" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.7 Extraction of relations between entities
</SectionTitle>
      <Paragraph position="0"> This step of processing concerns the recognition of three types of relations between the recognized named entities: up-regulation, down-regulation, and (underspecified) regulation of expression. We combine syntactic properties (subcategorization restrictions) and semantic properties (selectional restrictions) of the relevant verbs to map them to one of the three relation types.</Paragraph>
      <Paragraph position="1"> The following shows a reduced bracketed structure consting of three parts, a promoter chunk, a verbal complex chunk, and a UAS chunk in patiens: null [nx prom the ATR1 promoter region] [contain contains] [nx uas pt [dt a a][bs binding site][for for] [nx activator the GCN4 activator protein]].</Paragraph>
      <Paragraph position="2"> From this we extract that the GCN4 protein activates the expression of the ATR1 gene. We identify passive constructs too e.g. &amp;quot;RNR1 expression is reduced by CLN1 or CLN2 overexpression&amp;quot;. In this case we extract two pairwise relations, namely that both CLN1 and CLN2 down-regulate the expression of the RNR1 gene. We also identify nominalized relations as exemplified by &amp;quot;the binding of GCN4 protein to the SER1 promoter in vitro&amp;quot;.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML