File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0802_intro.xml
Size: 15,305 bytes
Last Modified: 2025-10-06 14:01:53
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0802"> <Title>WHAT: An XSLT-based Infrastructure for the Integration of Natural Language Processing Components</Title> <Section position="3" start_page="1" end_page="2" type="intro"> <SectionTitle> 3 WHAT: The Whiteboard Annotation Transformer </SectionTitle> <Paragraph position="0"> The main motivation for developing an XSLT-based infrastructure for NLP components was to provide flexible access to standoff XML annotations produced by the components.</Paragraph> <Paragraph position="1"> XSLT (Clark 1999) is a W3C standard language for the transformation of XML documents. Input of an XSL transformation must be XML, while output can be any syntax (e.g., XML, text, HTML, RTF, or even programming language source code, etc.). The power of XSLT mainly comes from its sublanguage XPath (Clark and DeRose 1999), which supports access to XML structure, elements, attributes and text through concise path expressions. An XSL stylesheet consists of templates with XPath expressions that must match the input document in order to be executed. The order in which templates are called is by default top-down, left to right, but can be modified, augmented, or suppressed through loops, conditionals, and recursive call of (named) templates.</Paragraph> <Paragraph position="2"> WHAT, the WHiteboard Annotation Transformer, is built on top of a standard XSL transformation engine. It provides uniform access to standoff annotation through queries that can either be used from non-XML aware components to get access to information stored in the annotation (V and N queries), or to transform (modify, enrich, merge) XML annotation documents (D queries).</Paragraph> <Paragraph position="3"> While the WHAT is written in a programming language such as Java or C, the XSL query templates that are specific for a standoff DTD of a component's XML output are independent of that programming language, i.e., they must only be written once for a new component and are collected in a so-called template library.</Paragraph> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 3.1 WHAT Queries </SectionTitle> <Paragraph position="0"> Based on an input XML document (or DOM object), a WHAT query that consists of * component name, * query name, and * query-specific parameters such as an index or identifier is looked up in the XSLT template library for the specified component, an XSLT stylesheet is returned and applied to the XML document by the XSLT processor. The result of stylesheet application is then returned as the answer to the WHAT query. There are basically three kinds of results: * strings (including non-XML output, e.g. RTF or even programming language source code) * lists of unique identifiers denoting references to nodes in the XML input document * XML documents In other words, if we formulate queries as functions, we get the following query signatures:</Paragraph> <Paragraph position="2"> where C is the component, D an XML document, P* a (possibly empty) sequence of parameters, S* a sequence of strings, and N* a sequence of node identifiers.</Paragraph> <Paragraph position="3"> We now give examples for each of the query types.</Paragraph> <Paragraph position="4"> V-queries return string values from XML attribute values or text. The simplest case is a single XPath lookup. As an example, we determine the type of named entity 23 in a shallow XML annotation produced by the SPPC system (Piskorski and Neumann 2000).</Paragraph> <Paragraph position="5"> The WHAT query getValue(&quot;NE.type&quot;, &quot;de.dfki.lt.sppc&quot;, 23) would lead to the lookup of the following query in the XSLT template library for SPPC <query name=&quot;getValue.NE.type&quot; component=&quot;de.dfki.lt.sppc&quot;> <!-- returns the type of named entity as number --> named entity tag e.g. <NE id=&quot;23&quot; type=&quot;location&quot;...> somewhere below the root tag, this query would return the String &quot;location&quot;. By adding a subsequent lookup to a translation table (through XML entity definitions or as part of the input document or of the component-specific template library), it would also be possible to translate namings, e.g. in order to map SPPC-annotation-specific namings to HPSG type names.</Paragraph> <Paragraph position="6"> We see from this example how the WHAT helps to abstract from component-specific DTD structure and namings. However, queries need not be that simple. Complex computations can be performed and the return value can also be numbers, e.g., for queries that count elements, words, etc.</Paragraph> <Paragraph position="7"> An important feature of WHAT is navigation in the annotation. N-queries compute and return lists of node identifiers that can again be used as parameters for subsequent (e.g. V-)queries.</Paragraph> <Paragraph position="8"> The sample query returns the node identifiers of all named entities (NE tags) that are in the given range of tokens (W tags). The template calls a recursive auxiliary template that seeks the next named entity until the end of the range is reached. The WHAT query getNodes(&quot;W.NEinRange&quot;, &quot;de.dfki.lt.sppc&quot;,3,19) would lead to the lookup of the following query in the XSLT template library for SPPC.</Paragraph> <Paragraph position="9"> <query name=&quot;getNodes.W.NEinRange&quot; compon.=&quot;de.dfki.lt.sppc&quot;> Again, the query forms an abstraction from DTD structure. E.g., in SPPC XML output, named entity elements enclose token elements. This need not be the case for another shallow component; its template would be defined differently, but the query call syntax would be the same.</Paragraph> <Paragraph position="10"> D-queries return transformed XML documents - this is the classical, general use of XSLT. Complex transformations that modify, enrich or produce (standoff) annotation can be used for many purposes. Examples are * conversion from a different XML format * merging of several XML documents into one * auxiliary document modifications, e.g. to add unique identifiers to elements, sort elements etc. * providing interface to NLP applications (up to code generation for a programming language compiler...) * visualization and formatting (Thistle, HTML, PDF, ...) * perhaps the most important is to do (linguistic) computation and transformation in order to turn a WHAT query into a kind of NLP component itself. This is e.g. intensively used in the shallow topological field parser integration we describe below. Multiple queries are applied in a sequence to transform a topological field tree into a list of constraints over syntactic spans that are used for initialization of the deep parser's chart. One of these WHAT queries has more than 900 lines of XSLT code.</Paragraph> <Paragraph position="11"> We can show only a short example here, a query that inserts unique identifier attributes into an arbitrary XML document without id attributes.</Paragraph> <Paragraph position="12"> Note that this is an example for a stylesheet that is completely independent of a DTD, it just works on any XML document - and thus shows how generic XSL transformation rules can be.</Paragraph> <Paragraph position="13"> Another example is transformation of XML tree representations into Thistle trees (arbora DTD; see Calder 2000). While the output DTD is fixed, this is again not true for the input document which can contain arbitrary element names and branches. Thistle visualizations generated through WHAT are shown in Fig.4,5and6below.</Paragraph> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 3.2 Components of the Hybrid System </SectionTitle> <Paragraph position="0"> The WHAT has been successfully used in the Whiteboard architecture for online analysis of German newspaper sentences. For more details on motivation and evaluation cf. Frank et al. (2003) and Becker and Frank (2002). The simplified diagram in Figure 3 depicts the components and places where WHAT comes into play in the hybrid integration of deep and shallow processing components (V, N, D denote the WHAT query types). The system takes an input sentence, and 2000), which takes PoS-tagged tokens as input, and produces binary tree representations of sentence fields, e.g., topo.bin in Fig. 4. For a justification for binary vs. flat trees cf. Becker and Frank (2002).</Paragraph> <Paragraph position="1"> The results of the components are three standoff annotations of the input sentence. Then, a sequence of D-queries is applied to flatten the binary topological field trees (result is topo.flat, Fig. 5), merge with shallow chunk information from Chunkie (topo.chunks, Fig. 6), and apply the main D-query computing bracket information for the deep parser from the merged topo tree (topo.brackets, Fig. 7).</Paragraph> <Paragraph position="2"> Finally, the deep parser PET (Callmeier 2000), modified as described in Frank et al. (2003), is started with a chart initialized using the shallow bracket information (topo.brackets) through WHAT V and N queries. PET also accesses lexical and named entity information from Again, WHAT abstraction facilitates exchange of the shallow input components of PET without needing to rewrite the parser's code.</Paragraph> <Paragraph position="3"> The dashed lines in Figure 3 indicate that a WHAT-based application can have access to the standoff annotation through D, V or N queries.</Paragraph> <Paragraph position="4"> The Thistle diagrams below are created via D queries out of the intermediate topo.* trees.</Paragraph> <Paragraph position="6"/> </Section> <Section position="3" start_page="1" end_page="2" type="sub_section"> <SectionTitle> 3.3 Accessing and Transforming Deep Annotation </SectionTitle> <Paragraph position="0"> In the sections so far, we showed examples for shallow XML annotation. But annotation access should not stop before deep analysis results. In this section, we turn to deep XML annotation. Typed feature structures provide a powerful, universal representation for deep linguistic knowledge.</Paragraph> <Paragraph position="1"> While it is in general inefficient to use XML markup to represent typed feature structures during processing (e.g. for unification, subsumption operations), there are several applications that may benefit from a standardized system-independent XML markup of typed feature structures, e.g., as exchange format for Sailer and Richter (2001) propose an XML markup where the recursive embedding of attribute-value pairs is decomposed into a kind of definite equivalences or non-recursive node lists (triples of node ID, type name and embedded lists of attribute-node pairs). The only advantage we see for this kind of representation is its proximity to a particular kind of feature structure implementation.</Paragraph> <Paragraph position="2"> We adopt an SGML markup for typed feature structures originally developed by the Text Encoding Initiative (TEI) which is very compact and seems to be widely accepted, e.g. also in the Tree Adjoining Grammar community (Issac 1998). Langendoen and Simons (1995) give an in-depth justification for the naming and structure of a feature structure DTD. We will focus here on the feature structure DTD subset that is able to encode the basic data structures of deep systems such as LKB (Copestake 1999), PET (Callmeier 2000), PAGE, or the shallow system SProUT (Becker et al. 2002) which have a subset of The FS tag encodes typed Feature Structure nodes, F encodes Features. Atoms are encoded as typed Feature structure nodes with empty feature list. An important Encoding of type hierarchies or other possibly system or formalism-specific definitions are of course not covered by this minimal DTD.</Paragraph> <Paragraph position="3"> point is the encoding of coreferences (reentrancies) between feature structure nodes which denote structure sharing. For the sake of symmetry in the representation/DTD, we do not declare the coref attribute as ID/IDREF, but as NMTOKEN.</Paragraph> <Paragraph position="4"> An application of WHAT access or transformation of deep annotation would be to specifiy a feature path under which a value (type, atom, or complex FS) is to be returned. The problem here are the coreferences which must be dereferenced at every feature level of the path. A general solution is to recursively dereference all nodes in the path.</Paragraph> <Paragraph position="5"> We give only a limited example here, a query to access output of the SProUT system. It returns the value (type) of a feature somewhere under the specified attribute in a disjunction of typed feature structures, assuming that we are not interested here in structure sharing between complex values.</Paragraph> <Paragraph position="6"> <query name=&quot;getValue.fs.attr&quot; component=&quot;de.dfki.lt.sprout&quot;> To complete the picture of abstraction through WHAT queries, we can imagine that the same types of query are possible to access e.g. the same morphology information in both shallow and in deep annotation, although their representation within the annotation might be totally different.</Paragraph> </Section> <Section position="4" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.4 Efficiency of XSLT Processors </SectionTitle> <Paragraph position="0"> Processing speed of current XSLT engines on large input documents is a problem. Many XSLT implementations lack efficiency (for an overview cf. xmlbench.sourceforge.net). Although optimization is possible (e.g. through DTD specification, indexing etc.), this is not done seriously in many implementations.</Paragraph> <Paragraph position="1"> However, there are several WHAT-specific solutions that can help making queries faster. A pragmatic one is pre-editing of large annotation files. An HPSG parser e.g. focuses on one sentence at a time and does not exceed the sentence boundaries (which can be determined reliably by shallow methods) so that it suffices to split shallow XML input into per-sentence annotations in order to reduce processing time to a reasonable amount.</Paragraph> <Paragraph position="2"> Another solution could be packing several independent queries into a 'prepared statement' in one stylesheet. However, as processing speed is mainly determined by the size of the input document, this does not speed up processing time substantially.</Paragraph> <Paragraph position="3"> WHAT implementations are free to be based on DOM trees or plain XML text input (strings or streams). DOM tree representations are used by XSLT implementations such als libxml/libslt for C/Perl/Python/TCL or Xalan for Java. Hence, DOM implementations of WHAT are preferable in order to avoid unnecessary XML parsing when processing multiple WHAT transformations on the same input and thus help to improve processing speed.</Paragraph> <Paragraph position="4"> As in all programming language, there a multiple solutions for a problem. An XSL profiling tool (e.g.</Paragraph> <Paragraph position="5"> xsltprofiler.org) can help to locate inefficient XSLT code.</Paragraph> </Section> </Section> class="xml-element"></Paper>