File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1313_metho.xml
Size: 25,409 bytes
Last Modified: 2025-10-06 14:08:34
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1313"> <Title>Technology Corporation</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 The GENIA Corpus </SectionTitle> <Paragraph position="0"> The GENIA corpus (Ohta et al., 2002) is being developed in the scope of the GENIA project, which seeks to develop information extraction techniques for scientific texts using NLP technology. The corpus consists of semantically annotated published abstracts from the biomedical domain. The corpus is a collection of articles extracted from the on-line MEDLINE abstracts (U.S.</Paragraph> <Paragraph position="1"> National Center for Biotechnology Information, http://www.ncbi.nlm.nih.gov/, PubMed database).</Paragraph> <Paragraph position="2"> Since the focus of the corpus is on biological reactions concerning transcription factors in human blood cells, articles were selected that contain the MeSH terms human, blood cell and transcription factor.</Paragraph> <Paragraph position="3"> As usual for the field, the articles are composed largely of structurally very complex technical terms, and are almost incomprehensible to a layperson. A typical heading e.g., reads IL-2 gene expression and NF-kappa B activation through CD28 requires reactive oxygen production by 5-lipoxygenase.</Paragraph> <Paragraph position="4"> The main value of the GENIA corpus comes from its annotation: all the abstracts and their titles have been marked-up by two domain experts for biologically meaningful terms, and these terms have been semantically annotated with descriptors from the GENIA ontology.</Paragraph> <Paragraph position="5"> The GENIA ontology is a taxonomy of, currently, 47 biologically relevant nominal categories, such as body part, virus, or RNA domain or region; the taxonomy has 35 terminal categories.</Paragraph> <Paragraph position="6"> The terms of the corpus are semantically defined as those sentence constituents that can be categorised using the terminal categories from the ontology. Syntactically such constituents are quite varied: they include qualifiers and can be recursive.</Paragraph> <Paragraph position="7"> The GENIA corpus is encoded in the Genia Project Markup Language. The GPML is an XML DTD (Kim et al., 2001) where each article contains its MEDLINE ID, title and abstract. The texts of the abstracts are segmented into sentences, and these contain the constituents with their semantic classification. The GENIA ontology is provided together with the GENIA corpus and is encoded in DAML+OIL (http://www.daml.org/), the standard XML-based ontology description language. This structure and its annotation will be further discussed below.</Paragraph> <Paragraph position="8"> A suite of supporting tools has been developed or tuned for the GENIA corpus and GPML: the term annotation is performed with the XMLMind editor; an XPath-based concordancer has been developed for searching the corpus; and CSS stylesheets are available for browsing it.</Paragraph> <Paragraph position="9"> At the time of writing, the latest version of the GENIA corpus is 3.01, which has been released in April 2003. It consists of 2,000 abstracts with over 400,000 words and more than 90,000 marked-up terms. This version has not yet been marked-up with tokens or PoS information, although an earlier version (Genia-V3.0p) has been. The GENIA corpus is available free of charge from the GENIA project homepage, at http://www-tsujii.is.s.utokyo.ac.jp/GENIA/. null</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 The Text Encoding Initiative </SectionTitle> <Paragraph position="0"> The Text Encoding Initiative was established in 1987 as a systematised attempt to develop a fully general text encoding model and set of encoding conventions based upon it, suitable for processing and analysis of any type of text, in any language, and intended to serve the increasing range of existing (and potential) applications and uses. The TEI Guidelines for Electronic Text Encoding and Interchange were first published in April 1994 in two substantial green volumes, known as TEI P3.</Paragraph> <Paragraph position="1"> In May 1999, a revised edition of TEI P3 was produced, correcting several typographic and other errors. In December 2000 the TEI Consortium (http://www.tei-c.org/) was set up to maintain and develop the TEI standard. In 2002, the Consortium announced the availability of a major revision of TEI P3, the TEI P4 (Sperberg-McQueen and Burnard, 2002) the object of which is to provide equal support for XML and SGML applications using the TEI scheme. The revisions needed to make TEI P4 have been deliberately restricted to error correction only, with a view to ensuring that documents conforming to TEI P3 will not become illegal when processed with TEI P4. For GENIA, we are using the XMLcompatible version of TEI P4.</Paragraph> <Paragraph position="2"> In producing P4, many possibilities for other, more fundamental changes have been identified.</Paragraph> <Paragraph position="3"> With the establishment of the TEI Council, it became possible to agree on a programme of work to enhance and modify the Guidelines more fundamentally over the coming years. TEI P5 will be the next full revision of the Guidelines. The work on P5 has started, and the date of its appearance will likely be in 2004 and there are currently several TEI Working Groups addressing various parts of the Guidelines that need attention.</Paragraph> <Paragraph position="4"> More than 80 projects spanning over 30 languages have so far made use of the TEI guidelines, producing diverse resources, e.g., text-critical editions of classical works. TEI has also been influential in corpus encoding, where the best known example is probably the British National Corpus. However, while the TEI has been extensively used for annotating PoS tagged corpora, it been less popular for encoding texts used by the the Information Retrieval/Extraction community; here, a number of other initiatives have taken the lead in encoding, say, ontologies or inter-document linking.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Pros and cons of using TEI </SectionTitle> <Paragraph position="0"> Why, if a corpus is already encoded in XML using a home-grown DTD, to re-encoded it in TEI at all? One reasons is certainly the validation aspect of the exercise: re-coding a corpus, or any other resource, reveals hidden (and in practice incorrect) assumptions about its structure. Re-coding to a standard recommendation also forces the corpus designers to face issues which might have been overlooked in the original design.</Paragraph> <Paragraph position="1"> There are also other advantages of using TEI as the interchange format: (1) it is a wide-coverage, well-designed (modular and extensible), widely accepted and well-maintained architecture; (2) it provides extensive documentation, which comprises not only the Guidelines but also papers and documentation (best practices) of various projects; (3) it offers community support via the tei-l public discussion list; (4) various TEI-dedicated software already exists, and more is likely to become available; and (5) using it contributes to the adoption of open standards and recommendations.</Paragraph> <Paragraph position="2"> However, using a very general recommendation which tries to cater for any possible situation brings with it also several disadvantages: Tag abuse TEI might not have elements / attributes with the exact meaning we require. This results in a tendency to misuse tags for purposes they were not meant for; however, it is a case of individual judgement to decide whether to (slightly) abuse a tag, or to implement a local extension to add the attribute or element required. null Tag bloat Being a general purpose recommendation, TEI can -- almost by definition -- never be optimal for a specific application. Thus a custom developed DTD will be leaner, have less (redundant) tags and simpler content models. null TEI for humanities While the Guidelines cover a vast range of text types and annotations, they are maybe the least developed for &quot;high level&quot; NLP applications or have failed to keep abreast of &quot;cutting-edge&quot; initiatives. As will be seen, critical areas are the encoding of ontologies, of lexical databases and of feature structures.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Building the TEI DTD The TEI Guidelines (Sperberg-McQueen and </SectionTitle> <Paragraph position="0"> Burnard, 2002) consist of the formal part, which is a set of SGML/XML DTD fragments, and the documentation, which explains the rationale behind the elements available in these fragments, as well as giving overall information about the structure of the TEI.</Paragraph> <Paragraph position="1"> The formal SGML/XML part of TEI comes as a set of DTD fragments or tagsets. A TEI DTD for a particular application is then constructed by selecting an appropriate combination of such tagsets. TEI distinguishes the following types of tagsets: Core tagset : standard components of the TEI main DTD in all its forms; these are always included without any special action by the encoder.</Paragraph> <Paragraph position="2"> Base tagsets : basic building blocks for specific text types; exactly one base must be selected by the encoder, unless one of the combined bases is used.</Paragraph> <Paragraph position="3"> Additional tagsets : extra tags useful for particular purposes. All additional tagsets are compatible with all bases and with each other; an encoder may therefore add them to the selected base in any combination desired.</Paragraph> <Paragraph position="4"> User defined tagsets : these extra tags give the possibility of extending and overriding the definitions provided in the TEI tagset. Furthermore, they give the option of explicitly includ- null ing or ignoring (disallowing) each particular element licensed by the chosen base and additional tagsets.</Paragraph> <Paragraph position="5"> While a project-particular XML DTD can be constructed by including and ignoring the TEI DTD fragments directly (as exemplified in Figure 1), it is also possible to build -- for easier processing -- a one-file DTD with the help of the on-line TEI Pizza Chef service, available from the TEI web site.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Parametrising TEI for biomedical </SectionTitle> <Paragraph position="0"> corpora In previous work (Erjavec et al., 2003) we have already proposed a TEI parametrisation of GENIA which was quite broad in its scope. Because a number of tagsets could prove useful in the long term this parametrisation collected not only those that we considered necessary for the current version of GE-NIA, but also some that might prove of service in the future. Furthermore, we supported the encoding of both version 2.1 and 3.0 of the corpus. The resulting DTD was thus very generous in what kinds of data it caters for. To focus the discussion we, in the current paper, only address tagset that are immediately relevant to annotating biomedical texts. In Figure 1 we define the XML DTD that can be used for encoding biomedical resources, and that we used for GENIA V3.01. The XML prolog given in this Figure defines that hteiCorpus.2i is the root element of the corpus, that the external DTD resides at the given URL belonging to the TEI Consortium, and that a number of TEI modules, detailed below, are being used to parametrise the TEI to arrive at our particular DTD.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 TEI.XML </SectionTitle> <Paragraph position="0"> TEI P4 allows both standard SGML and XML encodings. Including the TEI.XML option indicates that the target DTD is to be expressed in XML.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 TEI.prose </SectionTitle> <Paragraph position="0"> The base tagset does not declare many elements but rather inherits all of the TEI core, which includes the TEI header, and text elements. A TEI document will typically have as its root element hTEI.2i which is composed of thehteiHeaderi, followed by thehtexti; c.f. right hand side of Figure 2, but note that the root element from the TEI.corpus module is used for the complete corpus.</Paragraph> <Paragraph position="1"> The TEI header describes an encoded work so that the text (corpus) itself, its source, its encoding, and its revisions are all thoroughly documented.</Paragraph> <Paragraph position="2"> TEI.prose also contains elements and attributes for describing text structure, e.g. hdivi for text division, hpi for paragraph, hheadi for text header, etc. The tagset is therefore useful for encoding the gross structure of the corpus texts; for an illustration again see Figure 2.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 TEI.linking </SectionTitle> <Paragraph position="0"> This additional tagset provides mechanisms for linking, segmentation, and alignment. The elements provided here enable links to be made e.g., between the articles and their source URLs, or between concepts and their hypernyms.</Paragraph> <Paragraph position="1"> It should be noted that while the TEI treatment of external pointers had been very influential, it was overtaken and made obsolete by newer recommendations. However, the TEI does have a Working Group on Stand-Off Markup, XLink and XPointer, which should produce new TEI encoding recommendations for this area in 2003.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 TEI.analysis </SectionTitle> <Paragraph position="0"> This additional tagset is used for associating simple linguistic analyses and interpretations with text elements. It can be used to annotate words, hwi, clauses, hcli, and sentences, hsi with dedicated tags, as well as arbitrary and possibly nested segments with the hsegi. Such elements can be, via attributes, associated with their analyses. This tagset has proved very popular for PoS-annotated corpora; for an illustration see Figure 3.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.5 TEI.corpus </SectionTitle> <Paragraph position="0"> This additional tagset introduces a new root element, hteiCorpus.2i, which comprises a (corpus) header and a series of hTEI.2i elements. The TEI.corpus tagset also extends the certain header elements to provide more detailed descriptions of the corpus material. null</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.6 TEI.extensions.ent </SectionTitle> <Paragraph position="0"> The file gives, for each element sanctioned by the chosen modules, whether we include or ignore it in our parametrisation. While this is not strictly necessary (without any such specification, all the elements would be included) we thought it wise to constrain the content models somewhat, to reduce the bewildering variety of choices that the TEI otherwise offers. Also, such an entity extension file gives the complete list of all the TEI elements that are allowed (and disallowed) in GENIA, which might prove useful for documentation purposes.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.7 TEI.extensions.dtd </SectionTitle> <Paragraph position="0"> This file specifies the changes we have made to TEI elements. We have e.g., added the url attribute to hxptri and hxrefi and tagging attributes to word and punctuation elements.</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.8 Conversion of GPML to TEI </SectionTitle> <Paragraph position="0"> Because the source format of GENIA will remain the simpler GPML, it is imperative to have an automatic procedure for converting to the TEI interchange format. The translation process takes advantage of the fact that both the input and output are encoded in XML, which makes it possible to use the XSL Transformation Language, XSLT that defines a standard declarative specification of transformations between XML documents. There also exist a number of free XSLT processors; we used Daniel Veillard's xsltproc.</Paragraph> <Paragraph position="1"> The transformation is written as a XSLT stylesheet, which makes reference to two documents: the GENIA ontology in TEI and the template for the corpus header. The stylesheet then resolves the GPML encoded corpus into TEI. The translation of the corpus is thus fully automatic, except for the taxonomy, which was translated by hand.</Paragraph> <Paragraph position="2"> Figure 2 illustrates the top level structure of the corpus, and how it differs between the GPML and TEI encodings. The most noticeable difference is, apart from the renaming of elements, the addition of headers to the corpus and texts. In the GENIA hteiHeaderi we give e.g., the name, address, availability, sampling description, and, for each abstract's hsourceDesci, twohxptris: the first gives the URL of the HTML article in the MEDLINE database, while the second is the URL of the article in the original XML. It should be noted that we use a locally defined url attribute for specifying the value of the pointer.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Characteristics of biomedical texts </SectionTitle> <Paragraph position="0"> In this section we review some challenges that biomedical texts present to the processing and encoding of linguistic information, and the manner of their encoding in our DTD.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Tokens </SectionTitle> <Paragraph position="0"> Tokenisation, i.e., the identification of words and punctuation marks, is the lowest level of linguistic analysis, yet is, in spite (or because) of this of considerable importance. As all other levels of linguistic markup make direct or direct reference to the token stream of the text, so if this is incorrect, errors will propagate to all other annotations.</Paragraph> <Paragraph position="1"> It is also interesting to note that current annotation practice is more and more leaning toward stand-off markup, i.e., annotations that are separated from the primary data (text) and make reference to it only via pointers. However, it is beneficial to have some markup in the primary data to which it is possible to refer, and this markup is, almost exclusivelly, that of tokens; see e.g., (Freese et al., 2003).</Paragraph> <Paragraph position="2"> Version V1.1 of GENIA has been also annotated with LTG tools (Grover et al., 2002). In short, the corpus is tokenised, and then part-of-speech tagged with two taggers, each one using a different tagset, and the nouns and verbs lemmatised. Additionally, the deverbal nominalisations are assigned their verbal stems.</Paragraph> <Paragraph position="3"> The conversion to TEI is also able to handle this additional markup, by using the TEI.analysis module. The word and punctuation tokens are encoded as hwi and hci elements respectively, which are further marked with type and lemma and the locally defined c1, c2 and vstem. An example of such markup</Paragraph> <Paragraph position="5"> is given in Figure 3.</Paragraph> <Paragraph position="6"> Given the high density of technical terms, biomedical texts are rife with various types of contractions, such as abbreviations, acronyms, prefixes, etc. As seen already in Figure 3, one of the more problematic apects of tokenisaton are parentheses. Almost all tokenisers (e.g., the LT one, or the UPENN tokeniser) take these as separate tokens, but many are in biomedical texts parts of terms. So, out of almost 35,000 distinct terms that have been marked up in the GENIA corpus, over 1,700 contain parentheses. Some examples: (+)-pentazocine, (3H)-E2 binding, (gamma(c))-like molecule.</Paragraph> <Paragraph position="7"> Correct tokenisation of the biomedical texts is thus a challenging tasks, and it is fair to say that, from a linguistic processing perspective, complex tokenisation is one of the defining characteristics of such corpora.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Terms </SectionTitle> <Paragraph position="0"> Annotation of terms is a prerequisite for meaningful processing of biomedical texts, yet it is often difficult to decide what constitutes a term in a text, and how to abstract away from local variations. Biomedical texts are largerly (one could almost say excusivelly) composed of terms, and, as mentioned, this brings with it complex abbreviatory mechanisms.</Paragraph> <Paragraph position="1"> Even though TEI offers a htermi element, we chose, in line with the original GPML encoding, to rather use the TEI.analysis clause (hcli) element to encode terms. In GENIA, the terms have been handannotated, and marked up with concepts from the GENIA ontology; this was also the defining factor of term-hood, namely that the term could be linked to a terminal concept of the GENIA ontology.</Paragraph> <Paragraph position="2"> In spite of the simple semantic definition, the syntactic structure of the terms in the corpus varies dramatically. Biomedical terms are in some ways similar to named entities (names of people, organizations, etc.) but from the linguistic perspective, they are different in that named entities are mostly proper nouns, while terms mostly contain common nouns, and the two differ in their syntactic properties. Terms in the corpus can also be nested, where complex terms are composed out of simpler ones, e.g., hclihcliIL-2 geneh=cli transcriptionh=cli.</Paragraph> <Paragraph position="3"> This nesting, and the reference to ontology concepts is often far from simple, as (partial) terms can appear in coordinated clauses involving ellipsis. For example, &quot;CD2 and CD 25 receptors&quot; refers to two terms, CD2 receptors and CD25 receptors, but only the latter actually appears in the text.</Paragraph> <Paragraph position="4"> In such cases by parsing the coordination all the terms can be identified and annotated; the TEI encoding achieves this by specifyng the propositional formula involving the participating concepts in the function attribute; for example, hcl function=&quot;(AND G.tissue G.tissue)&quot; ana=&quot;G.tissue&quot;ihclinormalh=cli and hclihypopigmentedh=clihcliskin samplesh=clih=cli.</Paragraph> <Paragraph position="5"> The ana attribute encodes the IDREF of the concept; currently, only same valued concepts are either conjoined or disjoined.</Paragraph> <Paragraph position="6"> The number of hcli elements in the GENIA corpus is 96,582, among which 89,682 are simple terms and 1,583 are nested terms that are contain 3,431 terms. 5,137 terms do not yet have the ana attribute for concept identification, so the total number of ontology-linked terms is 93,293.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.3 Ontologies </SectionTitle> <Paragraph position="0"> One of the more interesting questions in recoding GENIA in TEI was how to encode the ontology. The ontology is in GENIA GPML encoded in a separate document, conforming to the OIL+DAML specification. This, inter alia, means that that XML file heavily relies on XML Namespaces and the RDF recommendation. An illustrative fragment is given on the left side of Figure 4.</Paragraph> <Paragraph position="1"> Currently the GENIA ontology has a simple tree-like structure, i.e., it corresponds to a taxonomy, so we translated it to the TEI htaxonomyi element, which is contained in the hclassDecli of the header hencodingDesci. The TEI defines this element as &quot;[the classification declaration] contains one or more taxonomies defining any classificatory codes used elsewhere in the text&quot;, i.e., is exactly suited for our purposes.</Paragraph> <Paragraph position="2"> There are quite substantial differences between the two encodings: the DAML+OIL models class inclusion with links, while the TEI does it as XML element inclusion. This is certainly the simpler and more robust solution, but requires that the ontology is a taxonomy, i.e., tree structured. The second difference is in the status of the identifiers: in DAML+OIL they are general #CDATA links, which need a separate (XLink/XPointer) mechanisms for their resolution. In TEI they are XML ID attributes, <daml:Class rdf:ID=&quot;source&quot;></daml:Class> <taxonomy id=&quot;G.taxonomy&quot;> <daml:Class rdf:ID=&quot;natural&quot;> <category id=&quot;G.source&quot;> and can rely on the XML parser to resolve them.</Paragraph> <Paragraph position="3"> While this is a simpler solution, it does support document-internal reference only.</Paragraph> </Section> </Section> class="xml-element"></Paper>