File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0607_metho.xml

Size: 23,137 bytes

Last Modified: 2025-10-06 14:09:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0607">
  <Title>Feeding OWL: Extracting and Representing the Content of Pathology Reports</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Digital Pathology
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 The Application
</SectionTitle>
      <Paragraph position="0"> LUPUS is intended to support the pathologist in two ways. First, it is used to semantically annotate a large archive of case reports, turning them into a valuable resource for diagnosis and teaching. The system uses the case reports produced by experts (the pathologists) to extract information about the accompanying images (of the tissue samples), and thus produces semantic annotation both for the report and for those images.</Paragraph>
      <Paragraph position="1"> This corpus of cases can then be searched in a fast, content-based manner to retrieve case reports (the textual reports together with the images of tissue samples) that might be relevant for a case the pathologist is working on. The search is content-based in that it can make use of semantic relationships between search concepts and those occuring in the text. We also encode in rules knowledge about certain diagnostics tasks, so that for example queries asking for 'differential diagnosis' (&amp;quot;show me cases of diagnoses which are known to be easily confusable with the diagnosis I am thinking of for the present case&amp;quot;) can be processed--tasks which normally require consultation of textbooks. These search capabilities are useful both during diagnosis and for teaching, where it makes interesting examples immediately available to students.</Paragraph>
      <Paragraph position="2"> Another use case is quality control during input of new reports. Using our system, such reports can be entered in a purpose-built editor (which combines digital microscopy facilities (Saeger et al., 2003) with our semantic annotator / search engine), where they are analysed on-the-fly, and potential inconsistencies with respect to the background domain ontology are spotted.1 During the development phase of the system, we are using this feature 1Naturally, to gain acceptance by working pathologists, this process has to be &amp;quot;minimally invasive&amp;quot;.</Paragraph>
      <Paragraph position="3"> to detect where the coverage of the system must be extended.</Paragraph>
      <Paragraph position="4"> The present paper focuses on the process of extracting the relevant information from natural language reports and representing it in a semantic web-ready format as a precondition for performing searches; we leave the description of the search and retrieval functions to another paper. To give an idea of the kind of data we are dealing with, and of the intended target representation, Figure 1 shows an example report (at the top of the figure) and the representation of its content computed by our system (at the bottom).2 We discuss the input format in the following subsection, and the target representation together with the domain knowledge available to us in Subsection 2.3; discussion of the intermediate format that is also shown in the figure is deferred until Section 3.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Pathology Reports
</SectionTitle>
      <Paragraph position="0"> During the development phase of the system, we are using a corpus of 90 randomly selected case reports (ca. 13,000 words; i.e. the average length of the reports is ca. 140 words, with a standard deviation of 12 words) for testing and grammar development. Linguistically, these reports are quite distinguished: they are written in a &amp;quot;telegram&amp;quot;-style, with verbs largely being absent (a rough examination of the corpus showed that only about every 43rd token is a verb, compared to every 11th in a comparable corpus of German newspaper). Also, the vocabulary is rather controlled, with very little variation--this of course is good news for automatically processing such input. On the discourse level we also find a strict structure, with a fixed number of semantically grouped sections. E.g., information about the diagnosis made will normally be found in the section &amp;quot;Kritischer Bericht&amp;quot; (critical report), and the information in the &amp;quot;Makroskopie&amp;quot; and &amp;quot;Mikroskopie&amp;quot; sections (macroscopy and microscopy, respectively) will be about the same parts of the sample, but on different levels of granularity.</Paragraph>
      <Paragraph position="1"> The last peculiarity we note is the relatively high frequency of compound nouns. These are especially important for our task, since technical concepts in German tend to be expressed by such compound nouns (rather than by noun groups). While some 2What is shown in the figure is actually already the result of a preprocessing step; the cases as stored in the database contain patient data as well, and are formatted to comply with the HL7 standard for medical data (The HL7 Consortium, 2003).</Paragraph>
      <Paragraph position="2"> Moreover, the italicisation in the input representation and the numbers in square brackets are added here for ease of reference and are not part of the actual representations maintained by the system.</Paragraph>
      <Paragraph position="3"> of those will denote individual concepts and hence will be recorded in the domain lexicon, others must be analysed and their semantics must be composed out of that of their parts (see below).</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Lung Pathology Knowledge in OWL
</SectionTitle>
      <Paragraph position="0"> The result of processing such reports with LUPUS is a representation of (relevant aspects of) their content. This representation has the form of instances of concepts and assertions of properties that are defined in an ontology, which constitutes the domain knowledge of the system (at the moment focussed on pathologies of the lung). This ontology is specified in OWL DL (W3C WebOnt WG, 2004), a version of OWL with a formal semantics and a complete and decidable calculus. Consequently, the content of the texts is represented in OWD DL as well, and so the knowledge base of the system consists of the ontology and the instances.</Paragraph>
      <Paragraph position="1"> The ontology we use is compiled out of several medical sources (such as UMLS (The UMLS Consortium, 2003) and SNOMED (SNOMED International, 2004)), but since these sources often were not intended for machine reasoning (i.e., are not necessarily consistent, and use rather loosely defined relations), considerable effort has been spent (and is being spent) on cleaning them up.3 At the moment, about 1,000 domain-level concepts and ca. 160 upper-level concepts have been identified, which are connected by about 50 core relation types.</Paragraph>
      <Paragraph position="2"> To our knowledge, this makes it one of the biggest OWL-ontologies currently in use.</Paragraph>
      <Paragraph position="3"> Besides representing concepts relevant to our domain, the ontology also lists properties that instances of these concepts can have. These properties are represented as two-place relations; to give an example, the property &amp;quot;green&amp;quot; attributed to an entity x will in our system not be represented as &amp;quot;green(x)&amp;quot;, but rather as something like &amp;quot;colour(x, green)&amp;quot;. This allows us to enforce consistency checks, by demanding that for each second-order predicate (colour, malignity, consistency, etc.) appropriate for a given concept only one value is chosen.4 This choice of representation has consequences for the way the semantics of adjectives is represented in the lexicon, as we will see presently.</Paragraph>
      <Paragraph position="4"> 3There are several current research projects with a similar aim of extracting stricter ontologies from sources like those mentioned above (see e.g. (Schulz and Hahn, 2001; Burgun and Bodenreider, 2001)), and this is by no means a trivial task. The present paper, however, focuses on a different (but of course interdependent) problem, namely that of extracting information such that it can be represented in the way described here. 4Technically, these constraints are realised by functional data-properties relating entities to enumerated data types. An example report (with translation):  Stanzbiopsat [2] eingenommen durch Infiltrate einer soliden malignen epithelialen Neoplasie. [3] Die Tumorzellen mit distinkten Zellgrenzen [4], zum Teil interzellul&amp;quot;ar Spaltr&amp;quot;aume [5], zwischen denen stellenweise kleine Br&amp;quot;ucken [6] nachweisbar sind. Das Zytoplasma leicht basophil, z.T. auch breit und eosinphil, [7] die Zellkerne hochgradig polymorph mit zum Teil multiplen basophilen Nukleolen. [8] Deutliche desmoplastische Stromareaktion. [9]  malignant epithelial neoplasia. The tumor cells with distinct cell borders, partially intercellular spatia, between which sporadically small bridges are verifiable. The cytoplasm lightly basophil, in part also broad and eosinphile, the nuclei highly polymorphic, partially with multiple basophile nucleoli. Distinct desmoplastic stroma reaction.  |Biopsy cylinder from  Using OWL DL as a representation format for natural language content means certain limitations have to be accepted. Being a fragment of FOL, it is not expressive enough to represent certain finer semantic details, as will be discussed below. However, the advantage of using an emerging standard for delivering and sharing information outweighs these drawbacks.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Implementation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Overview
</SectionTitle>
      <Paragraph position="0"> As mentioned above, most of the sentences in our corpus do not contain a finite verb; i.e., according to standard rules of grammar they are elliptical. While a theoretically motivated approach should strive to resolve this ellipsis contextually (for example as described in (Schlangen, 2003)), in view of the intended application and for reasons of robustness we have decided to focus only on extracting information about the entities introduced in the reports-that is, on recognising nominal phrases, leaving aside the question of how verbal meanings are to be resolved.</Paragraph>
      <Paragraph position="1"> Our strategy is to combine a &amp;quot;shallow&amp;quot; preprocessing stage (based on finite-state methods and statistical approaches) with a symbolic phase, in which the semantics of the NPs is assembled.5 A requirement for the processing is that it must be robust, in two ways: it must be able to deal with unknown tokens (i.e., &amp;quot;out of vocabulary&amp;quot; items) and with unknown structure (i.e., &amp;quot;out of grammar&amp;quot; constructions), degrading gracefully and not just failing. Figure 2 shows a flow chart of the system; the individual modules are described in the following sections.</Paragraph>
      <Paragraph position="2"> 5This strategy sits somewhere between Information Extraction, where also only certain phrases are extracted, for which, however, normally no compositional semantics is computed, and &amp;quot;full&amp;quot; parsing, where such a semantics is computed only if the whole input can be parsed.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Preprocessing
</SectionTitle>
      <Paragraph position="0"> The first step, tokenising and sentence splitting, is fairly standard, and so we skip over it here. The second step, morpho-syntactic analysis, is more interesting. It is performed by an independently developed module called TAGH, a huge finite-state machine that makes use of a German word-stem lexicon (containing about 90,000 entries for nouns, 17,000 for verbs, 20,000 adjectives and adverbs, and about 1,500 closed class word forms). The transducer is implemented in C++ and has a very high throughput (about 20,000 words per second on modern machines). The coverage achieved on a balanced corpus of German is around 96% (Jurish, 2003), for our domain the lexicon had to be extended with some domain specific vocabulary.</Paragraph>
      <Paragraph position="1"> To give an example of the results of the analysis, Figure 3 shows (excerpts of) the output for Sentence 2 of the example report. Note that this is already the POS-disambiguated output, and we only show one analysis for each token. In most cases, we will get several analyses for each token at this stage, differing with respect to their part of speech tag or other morphological features (e.g., case) that are not fully determined by their form. (The average is 5.7 analyses per token.) Note also that the actual output of the module is in an XML format (as indeed are all intermediate representations); only for readability is it presented here as a table.</Paragraph>
      <Paragraph position="2"> Another useful feature of TAGH is that it provides derivational information about compound nouns. To give an example, (1) shows one analysis of the noun &amp;quot;Untersuchungsergebnis&amp;quot; (examination result).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
(1) Untersuchungsergebnis
</SectionTitle>
    <Paragraph position="0"> untersuch(V)[?]ung(n)/s#Ergebnis As this shows, the analysis gives us information about the stems of the compounds; this can be used to guide the computation of the meaning of the complex noun. However, this meaning is not fully com- null positional, as the nature of the relation between the compounds is underspecified. We represent this by use of an underspecified relation rel that holds between the compounds, and which has to be specified later on in the processing chain.</Paragraph>
    <Paragraph position="1"> The output of this module is then fed into a statistically trained POS-disambiguator, which finds the most likely path through the lattice of morphological analyses (Jurish, 2003) (with an accuracy of 96%). In cases where morphology failed to provide an analysis, the syntagmatically most likely POS tag is chosen. At the end of this stage all analyses for a given token agree on its part of speech; however, other features (number, person, case, etc.) might still not be disambiguated.</Paragraph>
    <Paragraph position="2"> At the next stage, certain sequences of tokens are grouped together, namely multi-word expression that denote a single concept in our ontology (e.g., &amp;quot;anthrakotische Lymphknoten&amp;quot; denotes a single concept, and hence is marked as one token of type NN at this step), and certain other phrases (e.g. specifications of spatial dimensions) which can be recognised easily but would require very specialised grammar rules later on.6 Then, the domain-specific lexicon is accessed, which maps &amp;quot;concept names&amp;quot; (nouns, or phrases as recognised in the previous step) to the concept IDs used in the ontology.7 Tokens for which there is no entry in that lexicon, and which are hence deemed 'irrelevant' for the domain, are assigned a 'dummy' semantics appropriate for their part of speech, so that they do not confuse the later parsing stage.</Paragraph>
    <Paragraph position="3"> (More details about this kind of robustness will be given shortly.)</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Chunk Parsing
</SectionTitle>
      <Paragraph position="0"> Next, the analyses of the tokens are transformed into a feature structure format, and are passed to the parsing component.8 The output of this stage is an intermediate semantic representation of (aspects of) the content (of which the notation shown in 1 is a variant). This format is akin to traditional logical forms and still has to be mapped into OWL; we decided on this strategy because such a format is closer to surface structure and hence easier to build compositionally (see discussion below in Section 3.5). Also note that the semantics is &amp;quot;flat&amp;quot;, and does not represent scope of quantifiers (which only very rarely occur in our data, and cannot be represented OWL in any case).</Paragraph>
      <Paragraph position="1"> To get an idea of the feature geometry used by the grammar see Figure 4; this figure also shows the semantic representations generated at this stage (in a different notation than in Figure fig:reps). Note the 'simulation' of typing of feature structures, and the representation of properties via second order properties as discussed above. Chunk parsing is performed by a chart parser running a grammar that is loosely inspired by HPSG (Pollard and Sag, 1994).9 The grammar contains context-free rules for fairly complex NPs (allowing arguments of Ns, modification by PPs, and coordination). When extracting chunks, the strategy followed by the system is to always extract the largest non-overlapping chunks.10 An example might help to illustrate the robust8Up until here, all steps are performed in one go for the whole document. The subsequent steps, on the other hand, are performed incrementally for each sentence. This allows the system to remove ambiguity when it occurs, rather than having to maintain and later filter out different analyses.</Paragraph>
      <Paragraph position="2"> 9The parser is implemented in PROLOG, and based on the simple algorithm given in (Gazdar and Mellish, 1989). It also uses code by Michael Covington for dealing with feature structures in PROLOG, which is described in (Covington, 1994). 10That strategy will prefer lenght of individual chunks over coverage of input, for example when there is one big chunk and two overlapping smaller chunks at each side of that chunk, that however together span more input.</Paragraph>
      <Paragraph position="3">  ness of the system. (2) shows a full syntactic analysis of our example sentence. Our system only recognises the chunks indicated by the brackets printed in bold typeface: since it can't recognise the predicative use of the verb here, it is satisfied with just building parses for the NPs it does recognise. (The round brackets around the analysis of the first word indicate that this parse is strictly speaking not correct if the full structure is respected.) (2) [NP ([NP) [NOM Stanzbiopsat] (]), [ADJP [VVPP2 eingenommen] [PP [P durch] [NP Infiltrate einer soliden malignen epithelialen Neoplasie.]]]]&amp;quot; This is an example of the system's tolerance to unknown structure; (3) shows a (constructed) example of an NP where the structure is covered by the grammar, but there are 'unknown' (or rather, irrelevant) lexical items. As described above, we assign a 'dummy semantics' (here, a property that is true of all entities) to words that are irrelevant to the domain, and so parsing can proceed.</Paragraph>
      <Paragraph position="4">  (3) Solid, hardly detectable tumor cells. solid(x) [?]true(x) [?]tumor cell(x)  A few last remarks about the grammar. First, as shown in Figure 4, NPs without determiner introduce an underspecified relation unspec det, and information about definiteness and number of determiners is represented. This means that all information to do discourse processing (bridging of definites to antecedents) is there; we plan to exploit such information in later incarnations of the system. Secondly, it can of course occur that there is more than one analysis spanning the same input; i.e., we can have syntactic ambiguity. This will be dealt with in the transformation component, where domain knowledge is used to only let through &amp;quot;plausible&amp;quot; analyses.</Paragraph>
      <Paragraph position="5"> Lastly, prepositions are another source for underspecification. For instance, given as input the string (4), the parser will compute a semantics where an underspecified with rel connects the two entities tumor and alveolar; this relation will be specified in the next step, using domain knowledge, to a relation contains.</Paragraph>
      <Paragraph position="6"> (4) Ein Tumor mit freien Alveolaren.</Paragraph>
      <Paragraph position="7"> A tumor with free alveolars.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Resolution of Underspecification using
Ontologies
</SectionTitle>
      <Paragraph position="0"> As described in the previous sections, the output of the parser (and of the morphological analysis) might still contain underspecified relations. These are resolved in the module described in this section. This module sends a query to a reasoning component that can perform inference over the ontology, asking for possible relations that can hold between (instances of) entities. For example (4) above, this will return the answer contains, since the ontology specifies that 'alveolars&amp;quot; are parts of tumours (via a chain of is-a-relations linking tumours with cells, and cells with alveolars). In a similar way the underspecification of compound nouns is resolved. This process proceeds recursively, &amp;quot;inside-out&amp;quot;, since compound nouns can of course be embedded in NPs that are parts of PPs, and so on.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.5 Mapping LF to OWL
</SectionTitle>
      <Paragraph position="0"> In the final step, the logical forms produced by the parser and specified by the previous module are transformed into OWL-compliant representations.</Paragraph>
      <Paragraph position="1"> This process is fairly straightforward, as should be clear from comparing the intermediate representation in Figure 1 with the target representation: a) unique identifiers for the instances of concepts are generated; b) in cases of plural entities (&amp;quot;three samples&amp;quot; - card(x,3) [?]sample(x)), several separate instances are created; and c) appropriateness conditions for properties are applied: if a property is not defined for a certain type of entity, the analysis is rejected.</Paragraph>
      <Paragraph position="2"> This translation step also handles potential syntactic ambiguity, since it can filter out analyses if they specify inconsistent information. Note also that certain information, e.g. about second order properties, might be lost, due to the restricted expressivity of OWL. E.g., an expression like &amp;quot;highly polymorpheous&amp;quot; in Figure 1 either has to be converted into a representation like polymorphism : high, or the modification is lost (polymorpheous(x)).</Paragraph>
      <Paragraph position="3"> This ends our brief description of the system. We now discuss a preliminary evaluation of the modules, related work, and further extensions of the system we are currently working on or which we are planning.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML