File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/m98-1021_intro.xml

Size: 5,495 bytes

Last Modified: 2025-10-06 14:06:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="M98-1021">
  <Title>DATE TRAILER PP SS S P TEXTDOCID ... SLUGSTORYID PREAMBLENWORDS DOC</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
OVERVIEW
</SectionTitle>
    <Paragraph position="0"> The basic building blocks in our muc system are reusable text handling tools whichwehave been developing and using for a number of years at the Language Technology Group. They are modular tools with stream input#2Foutput; eachtooldoesavery speci#0Cc job, but can be combined with other tools in a unix pipeline. Di#0Berent combinations of the same tools can thus be used in a pipeline for completing di#0Berent tasks.</Paragraph>
    <Paragraph position="1"> Our architecture imposes an additional constraint on the input#2Foutput streams: they should havea common syntactic format. For this common format wechose eXtensible Markup Language #28xml#29. xml is an o#0Ecial, simpli#0Ced version of Standard Generalised Markup Language #28sgml#29, simpli#0Ced to make processing easier #5B3#5D. Wewere involvedin the developmentofthexml standard, buildingon our expertise in the design of our own Normalised sgml #28nsl#29 and nsl tool lt nsl #5B10#5D, and our xml tool lt xml #5B11#5D. A detailed comparison of this sgml-oriented architecture with more traditional data-base oriented architectures can be found in #5B9#5D.</Paragraph>
    <Paragraph position="2"> A tool in our architecture is thus a piece of software which uses an api for all its access to xml and sgml data and performs a particular task: exploiting markup which has previously been added by other tools, removing markup, or adding new markup to the stream#28s#29 without destroying the previously added markup. This approach allows us to remain entirely within the sgml paradigm for corpus markup while allowing us to be very general in the design of our tools, each of which can be used for many purposes. Furthermore, because we can pipe data through processes, the unix operating system itself provides the natural #5Cglue&amp;quot; for integrating data-level applications.</Paragraph>
    <Paragraph position="3"> The sgml-handling api in our workbench is our lt nsl library #5B10#5D which can handle even the most complex document structures #28dtds#29. It allows a tool to read, change or add attribute values and character data to sgml elements and to address a particular elementinannsl or xml stream using a query language called ltquery.</Paragraph>
    <Paragraph position="4"> The simplest way of con#0Cguring a tool is to specify in a query where the tool should apply its processing. The structure of an sgml text can be seen as a tree, as illustrated in Figure 1. Elements in such a tree can be addressed in a way similar to unix #0Cle system pathnames. For instance, DOC#2FTEXT#2FP#5B0#5D will give all #0Crst paragraphs under TEXT elements which are under DOC.We can address an elementby freely combining partial descriptions, e.g. its location in the tree, its attributes, character data in the element and sub-elements contained in the element. The queries can also contain wildcards. For instance, the query .*#2FS will give all sentences anywhere in the document, at any level of embedding.</Paragraph>
    <Paragraph position="5"> Using the syntax of ltquery we can directly specify which parts of the stream wewant to process and which part wewant to skip, and we can tailor tool-speci#0Cc resources for this kind of targeted processing.  For example, wehave a programme called fsgmatch which can be used to tokenize input text according to rules speci#0Ced in certain resource grammars. It can be called with di#0Berent resource grammars for di#0Berent document parts. Here is an example pipeline using fsgmatch:</Paragraph>
    <Paragraph position="7"> In this pipeline, fsgmatch takes the input text, and processes the material that has been marked up as DATE or NWORDSusing a tokenisation grammarcalled date.gr; then it processes the materialin PREAMBLE using the tokenisation grammarpreamb.gr; and then it processes the #0Crst paragraph in the TEXT section using the grammar P0.gr.</Paragraph>
    <Paragraph position="8"> This technique allows one to tailor resource grammars very precisely to particular parts of the text. For example, the reason for applying P0.gr to the #0Crst sentence of a news wire is that that sentence often contains unusual information which occurs nowhere else in the article and whichisvery useful for the muc task: in particular, if the sentence starts with capitalised words followed by &amp;MD; the capitalised words indicate a location, e.g. PASADENA, Calif. &amp;MD;.</Paragraph>
    <Paragraph position="9"> Wehave used our tools in di#0Berent language engineering tasks, such as information extraction in a medical domain #5B4#5D, statistical text categorisation #5B2#5D, collocation extraction for lexicography #5B1#5D, etc. The tools include text annotation tools #28a tokeniser, a lemmatiser, a tagger, etc.#29 as well as tools for gathering statistics and general purpose utilities. Combinationsof these tools provide us with the means to explore corpora and to do fast prototyping of text processing applications. A detailed description of the tools, their interactions and application can be found in #5B4#5D and #5B5#5D; information can also be found at our website, http:#2F#2Fwww.ltg.ed.ac.uk#2Fsoftware#2F. This tool infrastructure was the starting point for our muc campaign.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML