File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2716_metho.xml

Size: 11,278 bytes

Last Modified: 2025-10-06 14:10:54

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2716">
  <Title>Layering and Merging Linguistic Anotations</Title>
  <Section position="3" start_page="4" end_page="89" type="metho">
    <SectionTitle>
2 ANC Document Life-Cycle
</SectionTitle>
    <Paragraph position="0"> Documents to be included in the ANC are acquired in many different formats, including MS Word, PDF, HTML, Quark Express, etc. Processing involves a series of steps, which are outlined below.</Paragraph>
    <Section position="1" start_page="4" end_page="89" type="sub_section">
      <SectionTitle>
2.1 Conversion from original format to
</SectionTitle>
      <Paragraph position="0"> &amp;quot;rudimentary&amp;quot; XML The ANC receives documents in a variety of different formats. The first step in processing is to convert the input documents into XCES XML with basic structural annotations included. The most common types of file formats encountered are: * Microsoft Word. The release of OpenOffice 2 has greatly simplified the processing of MS Word documents. OpenOffice uses XSL and XSLT stylesheets to export files to XML and ships with stylesheets to generate DocBook and TEI-compliant formats. We modified the TEI stylesheet to create XCES XML. OpenOffice's Java API enables us to automate and integrate OpenOffice with later processing steps.</Paragraph>
      <Paragraph position="1"> * XML/SGML/HTML. processing of XL files typically involves using XSLT to map element names to XCES. SGML and HTML files typically require pre-processing to render them into valid XML, folowed by the application of an XSLT stylesheet to convert them to XCES.</Paragraph>
      <Paragraph position="2">  * Quark Express. Several publishers provided documents prepared for publication using Quark Express. Quark documents can be exported in XML, but doing so is worthwhile only if the creator of the document takes advantage of Quark's style-definition facilities (which was not the case for any of the contributed Quark documents). We therefore exported the documents in RTF; however, many fonts and special characters are improperly rendered, and fairly extensive manual editing was therefore required to render the files into a format that could be used. Once edited, the same procedures for MS Word documents are used to generate XCES.</Paragraph>
      <Paragraph position="3"> * PDF. Bitmap PDF files are unusable for our purposes. Adobe Acrobat can generate plain text from PDF, although this process loses much of the formatting information that would be desirable to retain to facilitate later processing. In some cases, ligatures and other special characters are improperly represented in the text version, and it is not always posible to automatically detect and convert them to conform to the original. PDF documents with two or more columns cannot, to our knowledge, be extracted without some misordering of the text in the results.</Paragraph>
      <Paragraph position="4"> * Other formats. Other formats in which the ANC has acquired documents include plain text and plain text that employed a variety of proprietary markup languages.</Paragraph>
      <Paragraph position="5"> These documents are processed on a case by case basis, using specialized scripts.</Paragraph>
    </Section>
    <Section position="2" start_page="89" end_page="89" type="sub_section">
      <SectionTitle>
2.2 GATE processing and annotation
</SectionTitle>
      <Paragraph position="0"> We use the University of Sheffield's GATE system null  for the bulk of ANC document processing and annotation, currently including tokenization, sentence spliting, part of speech tagging, noun chunking, and verb chunking. Most annotations are produced using GATE's built-in ANIE components; we have, however, modified the ANIE sentence spliter and created several Java plug-ins for use in GATE that perform basic bokeeping, renaming of annotations/features, moving of annotations between annotation sets etc. We have also developed a scripting language  htp:/ americanationalcorpus.org/xoro.html processing and re-processing of the entire corpus, or to apply selected annotation steps without having to load the files into a GATE corpus or data store. This eases iterative development as documents are added and tols are refined.</Paragraph>
    </Section>
    <Section position="3" start_page="89" end_page="89" type="sub_section">
      <SectionTitle>
2.3 Creation of standoff annotation docu-
</SectionTitle>
      <Paragraph position="0"> ments We have developed several custom processing resources that plug into GATE to generate stand-off annotations in the XCES implementation of the LAF format. The last step in our GATE pipeline is to create the primary text document and generate all the required standoff annotation files.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="89" end_page="89" type="metho">
    <SectionTitle>
3 Standoff Format
</SectionTitle>
    <Paragraph position="0"> The ANC standoff format for annotations is a simple graph representation, consisting of one node set and one, or more, edge sets. The node set is represented by the text itself, with an implied node between each character. Each edge set is represented by an XML document and may contain one or more annotation types: logical structure, sentence boundaries, tokens, etc.</Paragraph>
    <Paragraph position="1"> An ANC header file for each document is used to associate the source text with the standoff annotation documents; for example:  graph; values of the from and to attributes denote the nodes (between characters in the primary text document) over which the edge spans.</Paragraph>
    <Section position="1" start_page="89" end_page="89" type="sub_section">
      <SectionTitle>
3.1 Annotating discontiguous spans
</SectionTitle>
      <Paragraph position="0"> Presently, the ANC includes standoff annotations that reference contiguous spans of data in the original (primary) document. However, we plan to add a wide variety of automatically-produced annotations for various linguistic phenomena to the ANC data, some of which wil reference discontiguous regions of the primary data, or may reference annotations contained in other standoff documents. This is handled as folows: given an anotation graph, G, we create an edge graph G' whose nodes can themselves be annotated, thereby allowing for edges between the edges of the original annotation graph G.</Paragraph>
      <Paragraph position="1"> For example, consider the sentence &amp;quot;My dog has fleas.&amp;quot; The standoff annotations for the tokens would be:</Paragraph>
      <Paragraph position="3"> Now consider the dependency tree generated by Minipar  given in Figure 2. The tree can be represented by annotating the token elements in the standoff annotation document as folows: &lt;!- Define some pseudo nodes -&gt; &lt;node type=&amp;quot;rot&amp;quot; id&amp;quot;E0&amp;quot; ref=&amp;quot;t3&amp;quot;/&gt; &lt;node type=&amp;quot;clone&amp;quot; id=&amp;quot;E2&amp;quot; ref=&amp;quot;t2&amp;quot;/&gt; &lt;!- Define edges in dependency tre -&gt;</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="89" end_page="89" type="metho">
    <SectionTitle>
4 Creating In-line Anotation Docu-
</SectionTitle>
    <Paragraph position="0"> ments We have developed an &amp;quot;XCES Parser&amp;quot;  that implements the org.xml.sax.XMLReader interface to create ANC documents containing in-line annotations in XML (or any other format). The XCES parser works as folows: annotations to be loaded are selected with the org.xml.sax.XMLReader.setProperty() method. The selected annotation sets are then loaded into a single list in memory and sorted, first by ofset and, if the offsets are the same, secondly by annotation type. At present, the ordering of annotation types are hard coded into the parser; work is underway to make the XCES parser &amp;quot;schema aware&amp;quot; so that embedding specifications can be provided by the user. Once the text is loaded and sorted, the appropriate SAX2 events are generated and dispatched to the org.xml.sax.ContentHandler (if one has been registered with the parser) in sequence to simulate the parsing of an XML document. While the parser wil allow the programmer to specify an ErrorHandler, DTDHandler, or EntityResolver, at this time no methods from those interfaces wil be invoked during parsing.</Paragraph>
    <Paragraph position="1"> In the current version of the XCES parser, when overlapping annotations are encountered, they are &amp;quot;truncated&amp;quot;. For example:  Work is underway to provide the option to generate milestones in CLIX/HORSE (DeRose, 204) format to represent overlapping hierarchies. null</Paragraph>
    <Section position="1" start_page="89" end_page="89" type="sub_section">
      <SectionTitle>
4.1 Using the XCES parser
</SectionTitle>
      <Paragraph position="0"> The XCES parser can be used in three ways: * from the command line. The xcesparser.jar file can be run as a command line program to print XML with inline annotation to standard output.</Paragraph>
      <Paragraph position="1"> * as the XML parser used by other applications. For example, Saxon  can take the name of the parser to use to parse the source document as a command line parameter. This allows us to apply XSLT stylesheets to ANC documents without having to translate them into XML first. * as a library for use in other Java applications. For example, The ANC Tol  is a graphical front end to the XCES parser.</Paragraph>
    </Section>
    <Section position="2" start_page="89" end_page="89" type="sub_section">
      <SectionTitle>
4.2 The ANC tool
</SectionTitle>
      <Paragraph position="0"> The ANC Tol provides a graphical user interface for the XCES parser and is used to convert ANC documents to other formats. Users specify their choice of annotations to be included. Currently, the ANC Tol can be used to generate the folowing output formats: * XML XCES format, suitable for use with the BNC's XAIRA  search and access interface; null * Text with part of speech tags appended to each word and separated by an underscore; null * WordSmith/MonoConc Pro format.</Paragraph>
      <Paragraph position="1"> The ANC Tol uses multiple implementations of the org.xml.sax.DocumentHandler interface, one for each output format, which the XCES parser uses to generate the desired output. Aditional output formats can be easily generated by implementing additional interfaces.</Paragraph>
      <Paragraph position="2"> Of course, if the target application understands annotation graphs, there is no need to bother with the XCES parser or conversion to XML. For example, we provide several resources for GATE  htp:/sourceforge.net/projects/xaira that permit GATE to open and read ANC documents with standoff anotations, or to load standoff annotations into an already loaded document.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="89" end_page="89" type="metho">
    <SectionTitle>
5 Future Work
</SectionTitle>
    <Paragraph position="0"> Currently the XCES parser is a prof of concept rather than a production grade tol. The parser is being augmented to invoke all the apropriate methods from the org.xml.sax.*Handler interfaces and throw the proper SAXExceptions at the appropriate times. We are also providing for some level of SAX conformance, rather than simply &amp;quot;doing what Xerces does&amp;quot;.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML