File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-2018_metho.xml
Size: 7,551 bytes
Last Modified: 2025-10-06 14:07:53
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-2018"> <Title>An XML-based document suite</Title> <Section position="3" start_page="0" end_page="2" type="metho"> <SectionTitle> <!ELEMENT CONCEPTS (CONCEPT)*> <!ELEMENT CONCEPT (WORD, DESC, SLOTS?)> <!ATTLIST CONCEPT TYPE CDATA #REQUIRED> <!ELEMENT WORD (#PCDATA)> <!ELEMENT DESC (#PCDATA)> <!ELEMENT SLOTS (RELATION+)> <!ELEMENT RELATION (ASSIGN_TO, FORM, CONTENT)> <!ATTLIST RELATION TYPE CDATA #REQUIRED> <!ELEMENT ASSIGN_TO (#PCDATA)> <!ELEMENT FORM (#PCDATA)> <!ELEMENT CONTENT (#PCDATA)> </SectionTitle> <Paragraph position="0"> We use attributes to show the description of the concepts and we can annotate the relevant relations between the concepts through nested tags (e.g. the tag SLOTS).</Paragraph> <Paragraph position="1"> The example above is part of the result of the analysis of the German phrase: Fertigen fester Koerper aus formlosem Stoff durch Schaffen des Zusammenhalts . The token Fertigen is classified as process with the relations source, result and instrument. The following phrases (noun phrases and preposition phrases) are checked to make sure that they are assignable to the relation requirements (semantic and syntactic) of the token Fertigen. Semantic interpretation of the syntactic structure An other step to analyze the relations between tokens can be the interpretation of the syntactic structure of a phrase or sentences respectively. We exploit the syntactic structure of the sublanguage to extract the relation between several tokens. For example a typical phrase from an autopsy report: Leber dunkelrot.</Paragraph> <Paragraph position="2"> From semantic tagging we obtain the following information: null Example 8 results of semantic tagging In this example we can extract the relation &quot;has-color&quot; between the tokens Leber and dunkelrot. This is an example of a simple semantic relation. Other semantic relations can be described through more complex variations. In these cases we must consider linguistic structures like modifiers (e.g. etwas), negations (e.g. nicht), coordinations (e.g. Beckengeruest unversehrt und fest gefuegt) and noun groups (e.g. Bauchteil der grossen Koerperschlagader). null Current state and future work The XDOC document workbench is currently employed in a number of applications. These include: AF knowledge acquisition from technical documentation about casting technology AF extraction of company profiles from WWW pages AF analysis of autopsy protocols The latter application is part of a joint project with the institute for forensic medicine of our university. The medical doctors there are interested in tools that help them to exploit their huge collection of several thousand autopsy protocols for their research interests. The confrontation with this corpus has stimulated experiments with 'bootstrapping techniques' for lexicon and ontology creation.</Paragraph> <Paragraph position="3"> The core idea is the following: When you are confronted with a new corpus from a new domain, try to find linguistic structures in the text that are easy to detect automatically and that allow to In English: production of solid objects from formless matter by creating cohesion In English: Liver dark red.</Paragraph> <Paragraph position="4"> classify unknown terms in a robust manner both syntactically as well as on the knowledge level. Take the results from a run of these simple but robust heuristics as an initial version of a domain dependent lexicon and ontology. Exploit these initial resources to extend the processing to more complicated linguistic structures in order to detect and classify more terms of interest automatically. An example: In the sublanguage of autopsy protocols (in German) a very telegrammatic style is dominant. Condensed and compact structures like the following are very frequent: Harnblase leer.</Paragraph> <Paragraph position="5"> Harnleiter frei.</Paragraph> <Paragraph position="6"> Nierenoberflaeche glatt.</Paragraph> <Paragraph position="7"> Vorsteherdruese altersentsprechend.</Paragraph> <Paragraph position="8"> ...</Paragraph> <Paragraph position="9"> These structures can be abstracted syntactically as BONounBQBOAdjectiveBQBOFullstopBQ and semantically as reporting a finding in the form BOAnatomic-entityBQ has BOAttribute-valueBQ and they are easily detectable (R&quot;osner and Kunze, 2002).</Paragraph> <Paragraph position="10"> In our experiments we have exploited this characteristic of the corpus extensively to automatically deduce an initial lexicon (with nouns and adjectives) and ontology (with concepts for anatomic regions or organs and their respective features and values). The feature values were further exploited to cluster the concept candidates into groups according to their feature values. In this way container like entities with feature values like 'leer' (empty) or 'gefuellt' (full) can be distinguished from e.g. entities of surface type with feature values like 'glatt' (smooth).</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> Related Work </SectionTitle> <Paragraph position="0"> The work in XDOC has been inspired by a number of precursory projects: In GATE (Site, 2002a; Cunningham and Wilks, 1988) the idea of piping simple modules in order to achieve complex functionality has been applied to NLP with such a rigid architecture for the first time. The project LT XML has been pioneering XML as a data format for linguistic processing.</Paragraph> <Paragraph position="1"> Both GATE and LT XML ((LTG), 1999) were employed for processing English texts. SMES (Neumann et al., 1997) has been an attempt to develop a toolbox for message extraction from German texts. A disadvantage of SMES that is avoided in XDOC is the lack of a uniform encoding formalism, in other words, users are confronted with different encodings and formats in each module.</Paragraph> <Paragraph position="2"> System availability Major components of XDOC are made publicly accessible for testing and experiments under the URL: http://lima.cs.uni-magdeburg.de:8000/ Summary We have reported about the current state of the XDOC document suite. This collection of tools for the flexible and robust processing of documents in German is based on the use of XML as unifying formalism for encoding input and output data as well as process information. It is organized in modules with limited responsibilities that can easily be combined into pipelines to solve complex tasks. Strong emphasis is laid on a number of techniques to deal with lexical and conceptual gaps and to guarantee robust systems behaviour without the need for a priori investment in resource creation by users. When end users are first confronted with the system they typically are interested in quick progress in their application but should not be forced to be engaged e.g. in lexicon build up and grammar debugging, before being able to start with experiments. This is not to say that creation of specialized lexicons is unnecessary. There is a strong correlation between prior investment in resources and improved performance and higher quality of results. Our experience shows that initial results in experiments are a good motivation for subsequent efforts of users and investment in extended and improved linguistic resources but that a priori costs may be blocking the willingness of users to get really involved.</Paragraph> </Section> </Section> class="xml-element"></Paper>